
Ask HN: What’s the legality of web scraping? - malshe
I teach machine learning applications to masters students. Many students ask me whether it’s legally OK to scrape websites without using an API and use the data for their projects for my course. I usually just direct them to use APIs with authentication or use tabular datasets on Kaggle, data.world, etc., because I’m not a lawyer and I don’t know the legality of web scraping. The most relevant article I know is from EFF (https:&#x2F;&#x2F;www.eff.org&#x2F;deeplinks&#x2F;2018&#x2F;04&#x2F;scraping-just-automated-access-and-everyone-does-it) but it’s more than a year old.<p>Can anyone who knows the law please guide me on this issue? Note that the concern is less about what’s ethical and more about what’s legal. This will also help me in my research because these days some reviewers are raising this concern when they see authors used web scraped data. Online there are a ton of opinion pieces but nobody is clear on the legal side of it. Mostly people oppose scraping because they think it’s unethical.
======
ddebernardy
The current state of the art is hiQ vs LinkedIn:

[https://www.eff.org/cases/hiq-v-linkedin](https://www.eff.org/cases/hiq-v-
linkedin)

Basically: if it's publicly visible, you can scrape it.

Caveat: the case is still making its way to the Supreme Court.

Edit: There's also Sandvig v. Sessions, which establishes that scraping
publicly available data isn't a computer crime:

[https://www.eff.org/deeplinks/2018/04/dc-court-accessing-
pub...](https://www.eff.org/deeplinks/2018/04/dc-court-accessing-public-
information-not-computer-crime)

Edit2: Two extra common sense caveats:

\- Don't hammer the site you're scraping, which is to say don't make it look
like you're doing a denial of service attack.

\- Don't _sell_ or publish the data wholesale, _as is_ \-- that's basically
guaranteed to attract copyright infringement lawsuits. Consume it, transform
it, use it as training data, etc. instead.

~~~
isalmon
#3 respect robots.txt

~~~
foob
This is a polite thing to do, but I don't think that there is any legal
precedence for it being an actual requirement. Notably, both Apple and The
Wayback Machine publicly disregard robots.txt files [1]. I would be very
curious to read any court ruling that determined a robots.txt file needs to be
respected.

[1] - [https://intoli.com/blog/analyzing-one-million-robots-txt-
fil...](https://intoli.com/blog/analyzing-one-million-robots-txt-files/)

~~~
viraptor
Wayback machine does look at robots.txt - [https://help.archive.org/hc/en-
us/articles/360004651732-Usin...](https://help.archive.org/hc/en-
us/articles/360004651732-Using-The-Wayback-Machine)

~~~
foob
They look at them, but they don't follow them strictly [1]. They make
judgement calls on what they should do rather than treating robots.txt files
as a legal contract.

[1] - [https://blog.archive.org/2017/04/17/robots-txt-meant-for-
sea...](https://blog.archive.org/2017/04/17/robots-txt-meant-for-search-
engines-dont-work-well-for-web-archives/)

------
Dangeranger
You may want to review the court decision in the LinkedIn vs hiQ case[0][1].

> It is generally impermissible to enter into a private home without
> permission in any circumstances. By contrast, it is presumptively not
> trespassing to open the unlocked door of a business during daytime hours
> because "the shared understanding is that shop owners are normally open to
> potential customers." These norms, moreover govern not only the time of
> entry but the manner; entering a business through the back window might be a
> trespass even when entering through the door is not.

[0] [https://arstechnica.com/tech-policy/2017/08/court-rejects-
li...](https://arstechnica.com/tech-policy/2017/08/court-rejects-linkedin-
claim-that-unauthorized-scraping-is-hacking/)

[1]
[https://www.documentcloud.org/documents/3932131-2017-0814-Hi...](https://www.documentcloud.org/documents/3932131-2017-0814-Hiq-
Order.html)

~~~
malshe
Thanks. This is the case the EFF article I linked in the original post also
refers to.

------
userbinator
_whether it’s legally OK to scrape websites without using an API_

I'm not a lawyer either, but making such a frivolous distinction has always
bothered me --- HTTP(S) and HTML _is_ an API, and it's the one the web browser
uses. Maybe the "official" API offers some better formatting and such, but
ultimately you're just getting the same information from the same source. As
long as you don't hammer the server to the point that it becomes disruptive to
other users, as far as they're concerned you're just another user visiting the
site.

IMHO making such a distinction is harmful because it places an artificial
barrier to understanding how things actually work. I've had a coworker think
that it was impossible to automate retrieving information from a (company
internal) site "because it doesn't have an API". It usually takes asking them
"then how did _you_ get that information?" and a bit more discussion before
they finally realise.

"If you asked a hundred people to go to different pages on a site and tell you
what they found, is that legal?"

~~~
nostrademons
The distinction is usually based on implied consent. The general legal
principle is that if you own property and you grant consent for people to use
it for one purpose, they are free to use it for that purpose, but you haven't
necessarily granted consent for other purposes. Offering an API is a strong
indication that you actually intend to allow people to consume the data with
software, because otherwise you wouldn't have bothered. Offering an HTML
interface is usually an indication that you intend for people to consume the
data with a web browser.

Offering an HTML interface _may_ be an indication that you also consent to
allowing machines to read the data through the HTML - that's the idea behind
search engines. But that's where it gets complicated, and that's why there's
all sorts of other considerations to the legal question. Things like did you
include the pages in question in robots.txt, did you say anything explicitly
about scrapers in the ToS, does the scraper offer a way to contact its owner
about abuse, has the website actually contacted them, has an IP ban been
issued, is the scraping for commercial purposes, does it compete directly with
the site, does it interfere with legitimate human use, etc.

~~~
closeparen
When I buy a book, the publisher has no say in whether I use it for personal
enjoyment, or for class discussion, or to write a (potentially negative)
review, or to feed my fireplace. They have some control over wholesale
reproduction via copyright law, not arbitrary power to decide what I do with
it like, say, a restaurant that says I can only use my seat at the table to
eat their food for a reasonable amount of time.

Why would bytes on the wire be any different from printed words on the page
here?

~~~
nostrademons
There's a distinction between the physical pages of the book and the text
contained within the book. You _own_ the physical pages of the book; you can
use it as a paperweight, coaster, toilet paper, fireplace fuel, whatever.

You do not own the copyright on the _words_ of the book, and in many of the
cases you list, the publisher _does_ have a say in that. If you want to put on
a school play based on the book, you need to get permission from the author.
(My high school put on an in-house adaptation of _Out of the Dust_ , and we
had to write Karen Hesse and get her okay to do so.) If you put the entirety
of the book on your website so that readers of your negative review can refer
back to it, the publisher can come after you with a cease & desist or, if you
ignore it, a lawsuit. If you write fanfiction based on the characters in the
book, the publisher can come after you with a C&D. If you want to make a movie
based on that book, you need to buy the film rights. (There's currently an
interesting situation with _Game of Thrones_ where HBO owns the film rights to
the world of Westeros, but the film rights to the characters & story of Dunk &
Egg are still owned by GRRM, so if the film rights to the earlier Dunk & Egg
stories were ever bought by a studio other than HBO, they would have to be
scrubbed of mentions of Targaryens, the Iron Throne, King's Landing, etc.)

In the pre-Internet days, the chance of enforcement was next to nil for many
of these cases, because the big studios and publishers all got licenses for
any IP, while class discussions, high school plays, and hobbyists never got a
wide audience for their work and so the original publisher would probably
never know (unless you did something really stupid like send it to them). The
Internet's blurred a lot of these boundaries.

~~~
closeparen
Copyright forbids specific actions (reproduction), it doesn’t let the
publisher set arbitrary terms on my consumption and use of the text.

~~~
nostrademons
It's more that copyright defines certain rights (hence the name) that are
owned initially by the author of the work and then may be transferred or
granted to other parties for compensation. The exact rights specified are
defined by statute, and then case law provides specific precedent for what it
_means_. So again, consult a lawyer.

But for a concrete example - one of the exclusive rights bestowed by copyright
is the right of reproduction. (It's not the only one, BTW: performance is
another one specific enumerated, as is distribution, as is creating derivative
works.) What does that mean? Well, courts have ruled that if you take an exact
digital copy of a work, as sold to the public, and publish it for free on a
torrent site, that's infringement. They've also ruled that there are various
"fair use" exceptions that give implicit rights to the general public even
when a work is under copyright. If you quote a sentence from a 300-page book
to support a point in an academic paper, that's not infringing.

Where's the boundary? Consult a lawyer, because there's lots of case law. I
remember that when I was at Google, there was a big debate over how big the
snippets (the little summaries of text on the results page could be). 2
sentences was fine. A paragraph was dodgy. Showing the entire page was a big
no-no. Showing the entire page when the user clicks on "cached" was okay when
I was there (I don't remember what the justification was for that), but that
option has since disappeared, so I wonder if they ran into problems. They got
around it with AMP, which requires explicit opt-in from publishers and so has
an explicit consent.

It's not all that different from regular property rights in that regard. You
own land. What does that mean? Well, normally it means that you can build a
house on it - but not if you have a conservation easement on the land, or if
local zoning codes forbid the type of dwelling you want. It normally means you
have the right to keep other people off your land - except that if your
property completely surrounds somebody else's property and cuts them off from
a public street, you're required to grant them an easement so that they can
cross your land to get to their dwelling. There are other sorts of easements
you can grant, too, which are all ways of either granting other people some of
the rights associated with your property (but not all of them) or
_restricting_ yourself from having some of those rights.

------
mushufasa
The rule of thumb seems to be:

\- If the website offers the data publicly (without authentication), it's free
to scrape.

\- If the data isn't protected by copyright or trademark, (e.g. public data,
such as an address of a house), it's free to reuse.

\- If you use the data to compete with a big company, they will sue you
regardless.

Court resolutions will vary on the court and judge.
[https://en.wikipedia.org/wiki/Web_scraping#Legal_issues](https://en.wikipedia.org/wiki/Web_scraping#Legal_issues)

~~~
skrebbel
> _If the data isn 't protected by copyright or trademark, (e.g. public data,
> such as an address of a house), it's free to reuse._

At the risk of stating the obvious, most data you'll find _is_ protected by
copyright. Eg this comment is written by me so according to nearly all
jurisdictions in the world, I own the copyright (unless HN has a clause that I
agreed to when I signed up that I sign it away, like stack overflow has).

Most forums, blogs, essays, articles, news sites, recipes and song lyrics are
covered by copyright. I'm pretty sure that a webshop's blarb about why product
x is good is covered by copyright.

~~~
toast0
Copyright protection on recipies is much more limited than the other creative
works you mentioned.

If you're scraping for more factual information, in some juristdictions, such
as the US, there's a good chance those aren't subject to copyright. Things
like addresses, opening hours, prices, inventory (but not a description of the
inventory), etc can be very useful to scrape and present in different ways.

------
nostrademons
Complicated. Ask a lawyer. It depends a lot on the specifics of what you're
doing, and case-law makes a lot of very subtle distinctions based on exactly
who you're scraping, what their ToS says, how they present the ToS, how much
data you take, what you do with that data, is it public, is it facts & numbers
vs. opinion & expression, how much you might inconvenience their other users
and staff, whether you're a direct competitor of them, etc.

I suspect you'll actually get different answers depending on _which_ lawyer
you ask. If you've got deep enough pockets you can probably ensure you get the
answer you want, and if you have _really_ deep pockets you can probably ensure
the court gets the answer you want. But if you're just a student who doesn't
want to end up in court, there are potential minefields there.

~~~
mushufasa
if you're just a student doing research, the risk that you'll end up in court
is near zero regardless of other details. Any company would have to argue that
you are 'causing damages' in order to sue you. So your research would have to
be harming their servers, siphoning away their customers, or otherwise
materially harming the company.

------
PyroLagus
As an alternative to manual scraping, you can use CommonCrawl[0] or other open
data sets, such as those provided by AWS[1]. That should alleviate any legal
concerns (I think. I'm not a lawyer, but I'm sure CommonCrawl and Amazon have
lawyers), and it's considerably faster than scraping. On top of that, you
don't end up placing an unnecessary load on random websites.

[0] [https://commoncrawl.org/](https://commoncrawl.org/)

[1] [https://registry.opendata.aws/](https://registry.opendata.aws/)

~~~
malshe
Thanks. I did not know about CommonCrawl

------
ltbarcly3
IANAL, but I have done tons of web scraping over the years.

My tips:

\- Keep careful control of the rate you scrape. Every time I have ever heard
of someone getting negative feedback it is because they have scraped pages at
a rate that caused an impact on the website they were scraping. If you don't
cause a noticeable increase in traffic/load nobody will check to see what is
going on, and generally nobody has a reason to care.

\- Some sites are notoriously aggressive at going after people, such as
craigslist. I wouldn't try to scrape them.

\- Use some kind of proxy!

~~~
userbinator
_\- Use some kind of proxy!_

Many proxies, in random order, would be the best.

That brings up another curious question: What's the legality of posting a site
to something like HN or Slashdot and effectively getting it DDoS'd...?

~~~
kyshoc
> posting a site to something like HN or Slashdot and effectively getting it
> DDoS'd

I imagine there's some reading of the CFAA that could _theoretically_ land you
in hot water for this, but this is silly.

Intent is very important. Can one sue or prosecute a popular food critic for
writing something about a restaurant, causing lines so long that long-time
regulars can't get a seat anymore?

On the other hand, you have things like booter services (essentially, DDoS as
a service). Continuing the analogy, I imagine if you hired 100 people to
physically block the entrance of a restaurant for some reason, you would be on
the hook for damages in civil court and something along the lines of
"disturbing the peace" in criminal court.

------
doh
This is a great question, one that is very important to our business [0] which
crawls many of the major social media platforms.

Andy Sellars [1] published a paper year ago on the topic titled "Twenty Years
of Web Scraping and the Computer Fraud and Abuse Act" [2] which puts the topic
in a great perspective. Many of the cases are not very clear cut and sway from
one direction to another. We are currently in the up where courts side with
the "crawlers" which may change in couple of years.

[0] [https://pex.com](https://pex.com)

[1] [https://twitter.com/andy_sellars](https://twitter.com/andy_sellars)

[2]
[https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3221625](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3221625)

------
reefoctopus
Teach your students to ensure there’s a delay between requests so they aren’t
hammering anyone’s server, and follow the rules in the robots.txt. I’ve
scraped more than a billion pages without any issues.

~~~
gshdg
Just because it’s technically feasible does not mean it’s legal or ethical.

~~~
reefoctopus
What makes it unethical?

Why should I be treated differently than search engine spiders?

If somebody doesn’t want their site scraped then they can let people know with
robots.txt. Get off your high horse.

~~~
patrickmcnamara
They never said it was unethical.

------
la_fayette
I teach software development in a data science master's degree. We learn about
web scraping, it is an important skill for a future data scientist IMHO, as
the web is the largest and most important dataset in the world.

Google is massively scraping the web and is building products on top of the
data, e.g. flight/hotel search. Why shouldn't we be allowed to do the same?

As others pointed out that one should take care about ToS.

~~~
cookiecaper
> Google is massively scraping the web and is building products on top of the
> data, e.g. flight/hotel search. Why shouldn't we be allowed to do the same?

The short answer is because they have the muscle and you don't.

This was litigated in _Perfect 10 v. Amazon_ , and the only way out was for
the panel deciding the case was to claim Google's use is "fair" due to its
unprecedented and "transformative" nature, which basically should be read as
"we don't want to face the public scorn of being the judges responsible for
shutting down Google Images". Such advantage is unlikely to be a factor in
less-prominent cases.

Even if you believe you can convince a panel of judges that your project
specifically meets the four-prong test for fair use, it takes millions of
dollars to litigate a case that far, which is well outside the realm of
possibility for most independent projects.

Flight and hotel search is a good example. Why can't you find Southwest fares
on any aggregator? People try to scrape them all the time, and as soon as they
come to Southwest's attention, they get C&D'd and shut down. This is a common
and well-established practice and hundreds of companies die by it every year.

~~~
la_fayette
Ok. As i am not a lawyer i cannot say anything about the us. In germany there
are many meta search or aggregator services for various things. All of them
use web-scraping extensively. E.g. popular price comparison:
[https://geizhals.de/](https://geizhals.de/).

The argumentation about "tranformative" nature of something making it
exceptional and above the law sound not intuitive to me.

------
avian
> the concern is less about what’s ethical and more about what’s legal.

Please reconsider this position. You're teaching the future generation of
engineers and scientists. Even if it's not strictly the topic of your course,
please don't teach your students that everything that's technically legal to
do is fine. Show that being socially conscious matters as well. Everybody will
be better off.

~~~
hartator
I completely disagree on 2 levels:

First, teachers shouldn’t be teaching morals. Specially in college and
university. The slippery slope between morals to politics is a dangerous one.
I rather them focusing on their actual course materials

Finally, there is nothing wrong with scrapping on ethical standpoint if you
don’t DDOS the target services. It gave us search engines. And that’s probably
one of most important breakthrough for humanity in the past few decades.

~~~
avian
The question obviously came up during their teaching, so it's become part of
the course, whether they want it or not. OP also says that their peers think
there are ethical questions in regard to scraping.

I don't see where you see politics in how they handle such questions. I'm not
advocating they go on an extended lecture about their personal views on the
political system that made the laws and what not.

I'm saying that there's a difference between handling these kind of questions
with "if you're not sure, maybe you should kindly ask the publisher of the
data if they would be ok with you scraping/using it that way" and "if your
lawyer says you're in the clear, fuck them and scrape away."

~~~
hartator
The OP asked explicitly about legality not ethics. So no ethis didn't came up
organically.

------
gshdg
It may depend on how you use the data — for instance, publicly sharing what
you scraped is a clear copyright violation in many cases.

And in some cases scraping is a violation of ToS. (Though who knows whether
that’s ever been litigated as enforceable.)

~~~
malshe
Thanks for pointing out the ToS. Does the ToS apply even when someone is not
logged in to an account?

~~~
cookiecaper
It depends, but in general, yes, it can be made to apply with a small amount
of well-placed boilerplate language. It's called either "clickwrap" or
"linkwrap" depending on the way it's presented.

See Nguyen v. Barnes and Noble at
[https://en.wikipedia.org/wiki/Nguyen_v._Barnes_%26_Noble,_In...](https://en.wikipedia.org/wiki/Nguyen_v._Barnes_%26_Noble,_Inc).
for a recent example that represented a _loosening_ of precedent by ruling
that the ToS was not enforceable because the user did not receive adequate
notice. If B&N had placed their disclaimer in a place where the user was more
likely to see it, they would've been fine.

------
strooper
Web crawling by search engines shouldn't be far from web scraping in terms of
data collection. I am wondering what is the legal boundary of web crawling for
search engines? While web scraping sounds sneaky, why isn't web crawling?

~~~
bytemode
You willingly submit links to a service to crawl your site, there's nothing
like "consent" for scraping...

~~~
nostrademons
You don't, actually, most sites are discovered organically through links on
other sites. Submitting links hasn't been common since the days of Yahoo and
DMOZ.

You're right that "consent" is the important legal issue, but it's usually
implied based on what your site requires re: authentication/authorization,
robots.txt, and the controls Google has provided to let you tell them not to
index a site.

------
danpalmer
I've raised this same point on a bunch of threads about scraping in the past,
but...

Scraping is fine if you ask the company and get permission!

This may seem obvious, but so many conversations about scraping seem to start
from the position that it is in some fundamental way, not allowed. This is not
true.

Conversations also seem to start from the assumption that you need to scrape
the whole web, which again is not true.

If you're teaching a machine learning course, perhaps you have a project on
classifying... cars. Do you need to scrape the whole web to get a bunch of
data about cars? No. Could you get away with scraping just Autotrader or a
similar site? Maybe! Why not ask them! If you clearly state that it's for
learning, that credit will be given, etc, you may find them quite amenable to
it.

I work at a company built significantly around web scraping, and we have
contracts with all of our scrape targets that confirm we are allowed to scrape
them.

------
unnouinceput
biggest scraper of the world? google. do they obey robots.txt? not a chance,
they really don't care. So do what google does, which is basically they run it
like they own the world and guess what? it's actually legal

~~~
pxtail
Funny thing is that it is legal for _them_ and same rules do not apply for
tiny plankton. So when it comes to scraping there is no choice but move in the
gray area, hope that you don't get caught (or that _they_ will notice too late
and your product will be noticeable and large enough that it will be allowed
to join big fish club)

------
bjourne
What jurisdiction are we talking about? Laws aren't the same everywhere.

~~~
OrgNet
is it illegal anywhere?

~~~
bjourne
Yes, in Sweden it is.

~~~
OrgNet
Let me put them on my blacklist...

------
anonu
I work in Fintech. One of our products is "alternative data"... Where we sell
financial datasets to other financial institutions, mainly hedge funds.

Typically the client will ask you to fill out a questionnaire about how you
create or generate the data. There are lots of questions about web scraping.

The general sense is that these firms are more and more sensitive to
purchasing data that has been scraped... Especially if it relates to
individuals or social media.

~~~
malshe
This is really interesting!

------
natch
I’ve seen projects where the company outsourced the scraping to contractors
with just vague instructions to “source the data.” That way the company is
insulated somewhat. Not entirely. And the ethical issue is still there. It’s
not an answer to your question, but this does tell you what some companies do
in practice to sidestep the issue. Might work for some research projects too.

------
stevage
There are two specific concerns:

1) Copyright 2) Terms of service.

If doing hobby/education projects and not publishing what you create,
copyright isn't really relevant.

As for violating terms of service (which is very likely), that's not
"illegal", it just opens you up to being sued. Which is very unlikely, if no
one is making money out of it, or hurting the service itself.

------
malshe
Hi all, I read all your comments [around 10 pm US ET]. Thanks a lot for taking
time to share you knowledge and thoughts! I have a much better idea about the
legalities. This will help me immensely in my teaching as well as research. In
fact, this will also help my students in their current and future jobs.

------
hartator
Disclaimers: IANAL. And I run [https://serpapi.com](https://serpapi.com). I
can give you free credits to your students for ML uses if you want.

Legality highly depends on where you are.

In the US, scraping of public data is a fair use exception protected by the
first. If you have to sign in to access the data, you then might be bound by
the ToS.

In Europe, scraping of public data can be against several laws. Notably GDPR,
the new copyright law, and you might be infringing copyrights on database as
defined by the CNIL.

~~~
malshe
Thanks a lot! We are based in the US. I will get in touch with you.

~~~
cookiecaper
His advice is incorrect. Scraping of public data is by no stretch of the
imagination "protected by the First".

------
lwansbrough
Sandvig v. Sessions

[https://www.eff.org/deeplinks/2018/04/dc-court-accessing-
pub...](https://www.eff.org/deeplinks/2018/04/dc-court-accessing-public-
information-not-computer-crime)

~~~
malshe
Thanks. I was unaware of this case.

------
shiado
The CFAA is arbitrarily enforced and it is impossible to know if you are safe
legally. People in this thread are saying that publicly accessible data is
safe to scrape but that certainly wasn't the case in United States v. Andrew
Auernheimer.

~~~
ddebernardy
Sandvig v. Sessions is more recent and says otherwise for publicly available
information:

[https://www.eff.org/deeplinks/2018/04/dc-court-accessing-
pub...](https://www.eff.org/deeplinks/2018/04/dc-court-accessing-public-
information-not-computer-crime)

~~~
userbinator
I think this is the most important sentence from that article:

 _does not make it a crime to access information in a manner that the website
doesn’t like if you are otherwise entitled to access that same information._

------
xtiansimon
Several comments suggest you ask a lawyer. AFAIK, a lawyer can’t answer the
question without doing the work same as a doctor can’t tell if your sick,
without examination.

I know this isn’t the answer you are seeking, but it might help you find more
examples— the area of copyright and fair use has a longer history with digital
images. Here’s an example legal court case showing, as others have noted, the
ruling judge has great impact on the outcome: “Court Rules Images That Are
Found and Used From the Internet Are 'Fair Use'” By Jack Alexander, 2018-07-02
[1]

Maybe your educational institution has already done some legal work related to
issues of copyright and educational use?

Here is an example from a university where they have done the legal work and
constructed further guidelines to determine safe harbor guidelines.

“The use of copyright protected images in student assignments and
presentations for university courses is covered by Copyright Act exceptions
for fair dealing and educational institution users. [...] In certain
circumstances you may be able to use more than a "short excerpt" (e.g. 10%) of
a work under fair dealing. SFU's Fair Dealing Policy sets out "safe harbour"
limits for working under fair dealing at SFU, but the Copyright Act does not
impose specific limits.”

[1]: [https://fstoppers.com/business/court-rules-images-are-
found-...](https://fstoppers.com/business/court-rules-images-are-found-and-
used-internet-are-fair-use-263567)

[2]: I want to use another person's images and materials in my assignment or
class presentation. What am I able to do under
copyright]([https://www.lib.sfu.ca/help/academic-
integrity/copyright/stu...](https://www.lib.sfu.ca/help/academic-
integrity/copyright/students/student-images-assignment))

~~~
malshe
Thanks a lot for sharing these links. I will follow your advice and talk to
the university's legal folks about this.

~~~
xtiansimon
If your university is not already working on this, it seems you're ready to
raise the issue. Good luck. Please keep HN updated.

------
jaimex2
I'm puzzled by why anyone would think its not ok to scrape? What you do with
the data is what matters, as long as its transformed you're in the clear.

Scraping is the entire business model of every search engine.

------
IloveHN84
What's public, can be scraped.

Same applies to radio streams or Netflix videos: once they're streamed, you
can register the stream legally for yourself.

------
pp19dd
You can't copyright facts.

As journalists, we scrape things to collect information used toward
transformative analysis. Not straight-up mirroring. Facts, as stated by an
entity. So we've never run into a legal issue doing this as long as we used
the scrapes to synthesize results into data. For example, map of restaurant
closures by the health department, with statistics and graphs of violation
frequencies. Or analysis of lawyer performance by cross-referencing a state
judiciary database search with their team member lists for success rates and
other stuff.

Most of the sites we scraped were county, state or federal government sites
and they contained information available in the interest of the general
public. However, we crawled tons of private sites as well and as long as we
wore white hats we considered it fair game.

We typically tried to scrape things fairly without causing technical issues
but to be honest, we ignored robots.txt directives all the time but timed it
do happen during off hours, with backoff mechanisms in case we contributed
negatively to computational loads. The typical issues we ran into were
overeager system administrators who squashed or interfered our scraping
attempts under their personal interpretation of appropriateness. Sometimes
they sicced misguided lawyers after us. Most of the lawyers couldn't tell you
what you were doing wrong, let alone how a site is registered, what a glue
record is, how DNS differs from IP registry ownership, how collocated servers
work and who owns them. They couldn't prove what we did with any of the data
to even imply we violated any copyrights.

So we relied on our legal departments to clear the way in case of issues, but
in 15 years of doing this, I've never once had a legal issue come up and put a
stop to what we were doing under that operating premise of transforming the
information into data. Our legal team never got involved for that sort of
thing. There were issues, but they got resolved through communication or by
reconfiguring our scrapers. Even when we've also made the raw data available
to the public or other researchers, it hasn't come up as a problem.

In one case, a police department figure blocked us because they disliked our
coverage. Their pretense was that our geocoding wasn't accurate enough from
the information they provided, and rather than circumvent their blocking we
had face to face meetings to address those concerns and mollify their
concerns, on the record. They ended up providing us with additional
information to meet that accuracy. In another case, the CEO of a large private
company personally threatened us legally claiming we violated their terms of
services for their API endpoint. However, their terms of services mentioned
nothing about data retention once something became a data point and we felt we
were in the clear so we kept doing it for years and nothing came of it.

