
Ask HN: How do you feel about web scraping? - jackschultz
Most sites seem to have ToS saying that scraping is not allowed, but in lots of cases that shouldn&#x27;t be the case. If you see that, does it change your opinion?<p>Does the purpose of your scraping make a difference, if one use is just a project but another would be selling the data?<p>Are comments on sites like this public data or private?<p>What about sports statistics that sometimes are &quot;private&quot; by the league, rather than just open for people to use and write interesting articles about it?<p>Overall thoughts?<p>I actually created a site (I edited and deleted the mention of the name here because apparently people don&#x27;t believe that I don&#x27;t care about people looking at the site.) that scrapes comments and posts from Reddit that link to Amazon products and shows that information. I&#x27;m thinking of adding Amazon links from HN and other sites, but just not sure about how people would feel about scraping from sites like this.
======
throwaway6845
I built a simple CRUD app for a previous (small) employer. Nothing special
technology-wise, but a good concept, sound business model, and backed up with
a couple of full-time staff creating content for it. Line one of the T&Cs was
"no scraping". Business model was based on sales to individual users but we
were prepared to do analysis in aggregate if asked.

A scraper company, funded by magic money (Knight Foundation grants) and $1m of
VC, convinced a (UK) Government department to pay them to scrape our site for
some analysis the department wanted. They'd never contacted us, never asked
for permission, never asked if we could supply the data. Our company was
bumping along at this point and having to lay people off. Income from a nice
lucrative Government contract would have kept a couple more people in work.

The scraper company's FAQ was, in my view, full-on unethical:

> "we check the robots.txt file. If the site permits robots in general to
> scrape their site (NOT just GoogleBot!), then we will do so. We will make no
> effort to look for other terms and conditions as well."

You will ostentatiously "make no effort to look" for T&Cs in case they
prohibit the significant contract you're about to sign with the Government?
Whoa.

So how I feel about web scraping is simple: "don't be evil". If you're
diverting income or traffic from the original site, don't do it. If you're
genuinely adding value, go for it, but be open, be prepared to work with the
original site, and be prepared to accede to their wishes.

~~~
cheetos
Did your service offer a paid API? Scraping happens because of a lack of
better options. Surely you can understand why the scrapers didn't want to
contact you beforehand.

~~~
throwaway6845
In what world is "engage a third-party scraping company" a better option than
"drop a quick email to the site operator"?

~~~
binarymax
Because with scraping you are in a legal grey area. But if you contact the
site directly and they say "no", then there is no excuse to scrape.

~~~
throwaway6845
Well, yeah, like if you ask someone to sell you their dog and they say "no".
Doesn't justify stealing the dog.

~~~
binarymax
Your analogy doesn't hold up. Your example is clearly theft, and is a criminal
matter. Violation of a sites terms is a civil one, and again is a legal grey
area. Scraping the site doesn't delete the content from a server...but there
is only one copy of the dog.

EDIT - I can't reply to your comment below, but FWIW I agree that scraping
sites in this manner is unethical. I am merely describing the logic that most
scrapers go through for self justification and legal protection.

~~~
throwaway6845
The analogy was "doing something anyway because you might be told 'no'", not
"my server behaves like a canine quadruped". Copyright infringement is also
potentially criminal in the UK, it's not as simple as you suggest.

But whatever. It just saddens me that the internet is a constant "don't be a
dick" battle with companies like the scraper guys.

(edit - understood :) )

------
alkonaut
How would the ToS apply to me if I never read them? If I write a script that
downloads a page then I just downloaded it, never viewed it.

If you put something on a http endpoint then I can download it. The terms of
use of the material just regulates what I can do with it (republish it) but I
can't possibly be forbidden to download it.

Anyone with a web server is of course free to just block my attempt to
download their page, via a user-agent filer or any other method.

Edit: IANAL so I have zero idea how law is actually applied around the world.
My view is just that "how on earth would it be possible to enforce a contract
I never saw?".

My view of course also says that I'm free to download the data at

[http://secret.somecompany.com:8081/secrets/data.bin](http://secret.somecompany.com:8081/secrets/data.bin)

Because they exposed it at a public http endpoint. I'm not so sure everyone
(including somecompany) would agree with that. I think it's still perfectly
reasonable that I'm allowed to do that - and that the restrictions that may
apply to material I reach that way is only related to how I _use_ it. Them
posting it publicly on the internet (even if it's not linked) is the same as
them having it on a billboard. I very much doubt that's how law works though.

~~~
SimplyUnknown
IANAL, but it came up in a computing science/internet law course and in the
Netherlands the current jurisprudence is that if you use (and continue to use)
a service you are reasonably expected to read, understand, and adhere to the
rules and conditions set by the service.

In this case it thus doesn't whether or not you even saw the contract. You are
bound by the ToS because it can be expected of you to search for it during
continued use of the service.

~~~
radmuzom
What happens if the TOS state that you are not allowed to use an ad-blocker
while viewing their website? (Honest question)

~~~
phn
Disclaimer: Pure conjecture.

I'd say the TOS have to apply to the "service" you're consuming and not the
website itself. Otherwise, what happens if you use just a text based browser?
Or any other means not foreseen by the TOS?

An ad-blocker would be just the same, it's something that your browser does.

~~~
manigandham
Intent matters. Ad blockers are not the same as text based browsers.

------
wslh
I think there is a contradiction between web scraping ToS and Internet
neutrality. Allowing a site to be scraped (not different with crawling
nowadays) only by Google but not by others violates that principle and
concentrate de power within few companies.

~~~
spynxic
Allowing a site to only be scraped/crawled by professionals (i.e. Google)
ensures that those engaging such activity do so in a respectable manner.

It would be ideal to define the terms for scraping (similar to how API calls
work) as opposed to the acceptable authors of scrapers.

~~~
wslh
I am not sure you can define those terms in a formal way. You can crawl the
data in some way but use it for many different purposes that are difficult to
control.

~~~
ellimilial
Fancy extension to robots.txt?

------
DocG
We had a client who contacted with an issue: site goes down every 15 minutes.
After couple of hours of debugging we found the culprit. Some swedish equal
rights server was "scraping" site every 15 minutes with massive ddos. We put
capacha for the server ip only (let them scratch their heads now) and let it
be. If you decide to scrape a site dont bring it down or assume its a big
server. Keep it civil and without impact on performance and nobody minds.

~~~
akerro
Why have you not reported it to them or/and police?

~~~
Geee
You don't need police in environments where you can defend yourself.

~~~
fao_
Ok, so for them it is no longer a problem, but what about other people? It is
their ethical duty to report them to the police, otherwise they were an
accessory in the crime (In some countries there are laws to enforce this)

~~~
tokenizerrr
If you make a shitty site that can't handle a few concurrent requests this is
your problem. It's not a DDoS. Not even a DoS. Just trying to request some
pages you make available, and you failing at doing so.

Or do you also plan on suing Google once they scrape your site?

~~~
fao_
Looks like somebody didn't read the parent.

> "scraping" site every 15 minutes with massive ddos

Google doesn't scrape every 15 minutes. It was implied that they scraped the
site often enough to cause a DDoS.

~~~
tokenizerrr
Looks like someone has never ran a website with even a slight amount of
traffic. A few requests every 15 minutes is nothing and your site should be
able to handle it.

Additionally, it clearly wasn't a DDoS. Quoting from the parent:

> Some swedish equal rights server was "scraping" site every 15 minutes with
> massive ddos. We put capacha for the server ip only (let them scratch their
> heads now) and let it be

If there is only one server ip, it is by definition not distributed. Just a
misuse of the term DDoS.

------
franze
I consult a sh#tload of aggregators, all of them involving at least some kind
of web-scraping, most of the time quite a lot of the data-sources were not
asked directly for their approval.

The simple formula is: "Create more value than you take." \- read it as
"Create more value for those whom you take from than the value you actually
take from them."

If you scrap sport results from a sports page and create a competing product,
as#hole move, don't do it and you will sued anyway.

Scrap sport results from a sport page, aggregate them into nice charts, but
for detailed inspection the user has to go to the source (which you link to
and want the user to use), you are good to go (from a product point of view,
legally i.a.n.a.l).

The internet is not a zero-sum-game. Building on top of each other, even on
data on top of each other, will lead to a better ecosystem, to more value for
every participant.

------
paulgb
In a couple places it sounds like you're interested in scraping HN -- if
that's the case there is an official API:
[https://github.com/HackerNews/API](https://github.com/HackerNews/API)

My own take on it in general is that for personal/research use I'm not morally
opposed to scraping, even when it's in violation of the ToS, with two
conditions: that it doesn't place an unreasonable burden on the server, and
that it doesn't invade people's privacy. The legal significance of the ToS is
murky at best (disclaimer: I'm not a lawyer) but if the site asks you
specifically to stop scraping them or puts up a technical barrier you should
stop (morally and, in the US at least, legally: see craigslist v 3taps)

~~~
bigiain
I'm pretty much on this wavelength. If I can automate something that I'd
otherwise do manually - or that a less technical person would happily do
manually, I'm fine with having python or perl and a cron job do it. What I try
to avoid is scope creep when my automation makes it easier to magnify the
scale. My dad used to collect his local synoptic chart from the weather bureau
website - and it frustrated him that they only archived 1 week's worth so when
he was away he'd need to ask someone else to download them for him or end up
with gaps in his collection - I happily scripted that up for him. What I
wouldn't do for him though, was make it grab _every_ chart for every region
available - if he wasn't doing that by hand, I wasn't going to break the terms
of service that much further and automate it for him... I'm happy to admit
when I do things like they they're without doubt technically infringing on
someone's legal rights, but ethically I consider this to be in the "of course
I'll walks across an intersection against a red light in the middle of the
night with no cars around" class of wrongdoing.

------
pedalpete
IANAL, but I'd suggest that it doesn't matter how people "feel", there is a
legal element here. For example, you can scrape NASDAQ data from any number of
sites, but those sites pay NASDAQ for the data feed. You do not have the right
to use that data. You need to get permission from NASDAQ. (I'm just using them
as an example.

You don't really get to decide how somebody else's data gets used.

Using your sports stastics example, this will become a grey area as writing
becomse more automated, but at the moment, a writer gets a 'statistic' like a
score which is made publicly available. There are no limits on using that
statistic. But you didn't automate the process of spreading the stats, you, in
theory found a fact and wrote about it.

This is different from just giving a feed of stats, or linking through a bunch
of services.

~~~
jdc
Of course this goes without saying that the law depends on the jurisdiction
where the scraping takes place.

------
danpalmer
I work in a company that does a lot of web scraping, mostly from companies
who's ToS says no scraping.

Instead of violating the ToS we have business contracts with each company,
that give us the permission to scrape. We use this as a way to take control of
integrations, and put the ball in our court, as most of these companies have
little to no technical expertise or resource. By doing this we can create an
integration as quickly as we want, instead of waiting months or even never
managing to get one if it were to be done through an API.

Scraping can be a powerful tool in this respect, make sure you have
permission, but ToS saying no scraping doesn't necessarily mean you can't get
special permission.

~~~
nurettin
A customer asked us to integrate their website with our mobile application.
They had no IT department to do the integration, and they didn't want to open
their database to the web, so I came up with this solution;

ssh into the machine, select using a db commandline tool, return data back
from stdout in csv format, parse data at our side. It's surprisingly fast and
secure.

~~~
caleblloyd
SSH tunneling supports forwarding remote hosts to local ports. Then you could
use a proper database connector in your app. For example access MySQL through
an SSH tunnel: [https://stackoverflow.com/questions/18373366/mysql-
connectio...](https://stackoverflow.com/questions/18373366/mysql-connection-
over-ssh-tunnel-how-to-specify-other-mysql-server)

~~~
nurettin
I forgot to mention that it was an SQLite database.

------
atemerev
If you can see it, you can scrap it. It is important to be nice and do not
overload site's systems, but personally I don't see a difference if I view the
site myself or some script is doing it for me for later consumption.

Technically, sites can do anything they want to make scraping more difficult.
But from the moral (and, I hope, legal — at least in the future) standpoint,
scraping should be your right.

~~~
IncRnd
There is clearly a difference between scraping and not scraping, or you
wouldn't want to scrape a site.

Why _should_ scraping be your right, morally? Your convenience doesn't remove
ownership of data.

~~~
atemerev
If the data is already published, it is effectively made public.

------
lazyjones
* most sites prohibit scraping but beg for Google to scrape them

* many good products/websites are based on scraped content

* many good products/websites are not feasible because of scraping limitations and limited access to data, even publicly funded data (e.g. no real estate ads with noise overlay despite the EU mandating noise maps in all member states; member states have prohibitive access rules for these maps)

* "rogue" scraping causes problems for many websites; blocking it creates problems for legitimate users of various proxy/anonymization services like Tor. Captchas are not a long-term solution, good programming defeats them.

I'd welcome a simple technical solution for scraping that takes into account
the interests of site owners, other publishers, the public. The sooner
publishers get together and build one, the better for them.

~~~
disiplus
google at this point is "phonebook" of the internet.

it is understandable that you want to be in phonebook but also that you dont
want somebody random to enter, use your resources and copy everything because
of some other interests.

if there would be some official list of all "phonebooks" and you would allow
those to scan your site but block others i would do it.

scrapers from my side, spam my logs/analytics, fake real visits, try to spend
my adword budget, some of them sign up with spam data and screw your
conversions and testing and so on. if it was only search engine i would allow
them but its not.

at this point i have the whole aws ip range blocked.

~~~
lazyjones
> _it is understandable that you want to be in phonebook but also that you
> dont want somebody random to enter, use your resources and copy everything
> because of some other interests._

It's not that simple: Google has used scraped content in the past to build
competing products to the sites it was from. Examples: reviews from shopping
sites for Google Shopping, Yelp reviews for Google Places.

Google can do it because it basically owns your front door. Everyone else
can't.

> _at this point i have the whole aws ip range blocked._

From past experience: this will likely block some apps, possibly services like
Alexa from accessing your site.

------
tomc1985
I laugh at people who try to exercise ownership of bits. It's foolish to place
data out there in the open with the expectation that people are going to treat
it nicely.

Data yearns to be free, stop fighting it!

------
tyingq
Reddit addresses this specifically in their TOS and prohibits it. They mention
a licensing program.

[https://www.reddit.com/help/useragreement/](https://www.reddit.com/help/useragreement/)

Doesn't really matter how we all feel about it. If you hit their radar,
they'll go after you. Can be expensive, whether you're in the right or not.

You're more likely to hit their radar if you're trying to make money. I
suspect you're using affiliate links, right?

~~~
jackschultz
Right, I've come across that page, but I include my name and contact in the
header, and don't make more than a request every 15 seconds so it won't hurt
their servers at all.

And I did have affiliation links initially where I included the non-affiliate
links next to them if people didn't want the affiliation. But then I got rid
of those cause people didn't like it. I don't care about making money on it,
just think it's interesting.

~~~
tyingq
I suspect they are more interested in their copyrighted content than load on
their servers. Hence the licensing program.

If you aren't trying to monetize, you will be lower on their radar, but not
immune.

~~~
__jal
Who's copyright?

~~~
tyingq
Reddit's.

 _" By submitting user content to reddit, you grant us a royalty-free,
perpetual, irrevocable, non-exclusive, unrestricted, worldwide license to
reproduce, prepare derivative works, distribute copies, perform, or publicly
display your user content in any medium and for any purpose, including
commercial purposes, and to authorize others to do so."_

The original comment submitters could grant you a license on their own, but
that might be difficult to coordinate.

~~~
LamaOfRuin
To be clear then, reddit has no apparent copyright (on individual comments
anyway). They have a license to the copyrighted work, which they're allowed to
re-license.

I view this as significant mainly because it changes what they could sue you
for, and any fair use evaluation if copyright does enter into it at some
point.

~~~
bigiain
And they specifically mention it's a "non-exclusive" license.

I suspect Reddit might claim a Collection Copyright in the compilation of
posts/comments - so even if you could acquire your own license to a post and
each of the comments - if you tried publishing that as a book, Reddit could
claim the organisation of those individually copyrighted items is owned by
them...

------
boomlinde
I have nothing against scraping. Ideally, all information on the web is ready
for consumption by machines and people alike, and any law or contract trying
to address machines and people differently in this regard is going to be
flawed and technically ambiguous. Potential traffic issues aside, this seems
largely unproblematic to me.

It's what you actually do with the information that matters. For example,
republishing or otherwise distributing information when you have no right to
do so may be an ethical and legal issue.

------
thinkMOAR
I think this might be an interesting read for webscrapers,

[https://en.wikipedia.org/wiki/Sui_generis_database_right](https://en.wikipedia.org/wiki/Sui_generis_database_right)

------
ianamartin
The last time I wrote a generalized web scraper (as opposed to something
specific that we had permission for) I put a lot of effort into distributing
the load across many websites so that no single site would feel any pain more
than if you were just browsing as a normal human being.

We were--at the time, scraping for lead information to add to our marketing
database, and this isn't the thing I'm exactly the most proud of in my career.
But we all make mistakes. I wouldn't do that again.

At the same time, we rotated things so that we weren't killing the websites in
the niche market that we were trying to scrape for leads.

The algorithm was that we would seed Google, Yahoo, and Bing with certain
keyword searches that were relevant. Then we'd take the search results from
the APIs and stuff them into an array. Then we would sort them proportionally.
If we (like we did) most often get the most hits from google, followed by
yahoo, and then by Bing, we'd stuff the results into an array and intersperse
them.

So if we had 3x google results and 2x yahoo, and 1x bing, we wouldn't hit the
google results first. We'd hit a google result 3x then a yahoo 2x, then a bing
1x and cycle.

It was a decent way of doing things.

We never broke anyone's stuff. Even if it should have been.

~~~
AznHisoka
i dont understand why you wouldnt be proud of this.. what exactly did you do
wrong?

~~~
ianamartin
I feel uncomfortable about what we were doing in general. I didn't feel bad
about the implementation. We were using the results of the web scraper to feed
what was basically spam emails trying to recruit people to sign up for our
product.

In the grand scheme of things, I think that I did the most responsible thing I
could have with the task I was assigned. But the task falls into a general
category of things I don't approve of.

It certainly wasn't illegal, and it probably wasn't unethical, but it was
definitely gross, and my internal standards tell me to avoid things that make
me feel gross.

That said, I'm kind of pleased with the technical results. Up until that time
in my career, I'd never encountered a sorting algorithm that handled things
based on the proportion of similar items in an array. I'm sure that other
people have done this and that there's no way it's novel at all. But it was a
cool challenge to make a shady task perform in a way that didn't break other
people's shit just so that we could maximize our own efficiency.

------
jawns
It sounds like what you're really asking is this ...

There are some websites whose primary business model is providing content in
exchange for something: a subscription fee, or advertising eyeballs. They have
a very strong financial interest in your not scraping their content and
providing it to others on different terms.

There are other websites who make some content available and explicitly
authorize people to use it: various datasets and RSS feeds and such.

And then there is a wide swath of websites that have adopted generic TOS that
prohibit scraping, or they prohibit it because they haven't given it much
thought and can't think of any particular reason off the top of their heads to
permit it.

So what you _really_ want to know is what sites in the third category would
consider a sensible scraping policy, if they had to give it sufficient
thought.

In other words, if they don't just default to a prohibition because it's
already in a TOS template or because they haven't thought it through, what's
the rationale for either blocking or not blocking scrapers?

------
aub3bhat
The legal system does not "decides" on legality unless its forced too.
Consider the case of Google Books project. Eventually the courts did rule that
it constituted fair use. Or consider the situation involving Flickr and
Pinterest. [1] or the one involving RapGenius and lyrics licensing. [2]

So to answer your questions:

    
    
        > Are comments on sites like this public data or private?
    
        > Does the purpose of your scraping make a difference, if one use is just a project but another would be selling the data?
    

There is no correct answer, at least unless you are willing to wait a decade
and spend millions while the cases make their way through the byzantine legal
system of districts, circuits, appeals and supremes. Unlike Science &
Technology where there is a "correct" answer, you should approach legal system
with different perspective using instincts and acceptable risk tolerance.

The history is littered with people who took a bet, and ended up succeeding or
failing upwards.

Finally ignorance is actually preferable to knowledge. By writing this
question or say having this conversation over an email you are simply creating
a paper trail that can only harm you if you get sued tomorrow. [3]

[1] [https://photo.stackexchange.com/questions/53304/what-is-
the-...](https://photo.stackexchange.com/questions/53304/what-is-the-legal-
consensus-on-pinning-photos-using-pinterest)

[2] [https://www.nytimes.com/2014/05/07/business/media/rap-
genius...](https://www.nytimes.com/2014/05/07/business/media/rap-genius-
website-agrees-to-license-with-music-publishers.html)

[3] [https://www.fastcompany.com/1588353/steal-it-and-other-
inter...](https://www.fastcompany.com/1588353/steal-it-and-other-internal-
youtube-emails-viacoms-copyright-suit)

------
hanoz
If it's done in a way which causes no greater load on the servers than a human
doing the same job, and especially if it's only for personal use, then I for
one feel entirely comfortable about it. And I would give pretty short shrift
to any robots.txt rules favouring Google alone, which are clearly morally
unreasonable and in my non-professional opinion legally questionable too.

------
TeMPOraL
I scrap stuff for personal use (mostly to generate RSS feeds for things like
newly added files on an FTP servers, or new search results for particular
query in a classified ads service). In those use cases, I don't care what the
TOS says. I'm just automating _my_ browsing, I'm not exfiltrating data I
shouldn't have access to, nor am I republishing it anywhere else.

But in general, I'm sympathetic to all scrapping efforts that provide benefit
to people for free.

------
WhiteSource1
There are many legitimate reasons for web scraping and I think it’s fair that
you set out your desires for scraping in your robots.txt file. You can block
other bots while allowing Googlebot. Alternatively, you can also use a WAF
that restricts bot access.

But I do have a problem with scraping other sites to detract traffic away from
them. Perhaps if it’s for data analysis. Also, I would be very careful as some
governments have very strict privacy laws (EU, for example) and you never know
what you are scraping.

But one of the other problems is that web scrapers can be used for very
nefarious purposes – see this article on web scraping attacks:
[https://www.incapsula.com/web-application-security/web-
scrap...](https://www.incapsula.com/web-application-security/web-scraping-
attack.html)

------
vbezhenar
If something is accessible from the Internet, it's public. Read it or scrape
it, doesn't matter. API is only to reduce burden for scrapers and servers.

Though legally it might be punished, so you better don't reveal yourself.

~~~
hueving
If it can be legally punished, I don't think it falls under a normal
definition of public, does it?

My bank account can be accessed from the internet, is it public? Or do you
mean without logging in? In that case, are quora answers public?

------
theshrike79
If you don't provide an API or your API is either rediculously expensive to
use or behind some arcane vetting process ( _cough_ Instagram _cough_ ), don't
be surprised people if scrape instead of using it.

And if you scrape, just don't be a dick about it. Don't hammer the site with
your buggy scraper doing 1k hits per second. And don't resell the data.

------
user5994461
>>> What about sports statistics that sometimes are "private" by the league,
rather than just open for people to use and write interesting articles about
it?

Not happening. Sports data is too much valuable. High sale value.

Don't give away for free what you could charge a lot of money for.

------
onli
There is no such thing as public data that is private. If it is in the web for
public consummation, the site has no right to forbid working with the data and
re-using it elsewhere. There just is no means for them.

That does not say that one has the right to publish the exact same content.
Then it becomes a copyright issue. But remixing it into another site, like
yours? You have every right.

Note: That might depend where you live. Legislation might differ, US' fair use
is a different concept than germanys public data concept, for example.

~~~
pbhjpbhj
Copyright doesn't cover factual content only the specific presentation of the
content if it's distinctive/artistic enough.

In Europe we have IPR covering data though.

------
orionblastar
I love scraping and hate it too. I have limited experience with it, but some
clients want you to web scrape competitors sites to steal their customer list
etc. It is a matter of ethics. I tried C# and a few other languages and it
didn't work because some Javascript on the site had anti-scraping script. I
declined the offer, as I didn't want to get sued by their competitor they
wanted me to web scrape.

Web scraping is going the way webbots work now. Have to scrape data to
automate stuff.

------
shaunpud
If a site provides a sitemapindex in the robots.txt, would you assume they're
allowing their site to be crawled on the pages they've set out within?

------
nandemo
This doesn't address your question directly, but are you aware that Reddit
provides an API? Why not use it instead of scraping?

------
drivingmenuts
"Web scraping" covers a pretty broad area from plain data collection to
snagging content for republishing.

Can you be a bit more specific?

~~~
jackschultz
Sure, I'm just curious about people's opinions on whether or not it's
necessary to follow site's rules on data scraping, whether it's always ok if
you're not trying to make money on it, or if something like giving credit to
where the data comes from makes a difference. Just things in general like that
if that makes sense.

------
sparkling
Our opinion does very little. You would have to go ever each site scraped and
read their terms of service / agreements.

~~~
ungzd
Google does not ask every site owner permission to index it.

~~~
tyingq
There is some quid pro quo with allowing Google to scrape though. You get
traffic in return.

This is changing, though, with Google's slow but steady movement of organic
search results down the fold.

------
sharemywin
Fair use angle:

[http://digitalcommons.law.yale.edu/cgi/viewcontent.cgi?artic...](http://digitalcommons.law.yale.edu/cgi/viewcontent.cgi?article=1107&context=yjolt)

------
AznHisoka
Your question is more suitable for the owners of those sites not us. After all
it isnt us who would be suing you in court. What we say has little to no
impact.

~~~
jackschultz
Oh I totally get that, which is why I make sure any of the scraping I do isn't
to make money, just to write about interesting analysis. But I'm just also
curious about people here's opinions on legality and whether they care about
following the specific site's rules.

~~~
IncRnd
Where I live, if there is no exchange of money for goods then there is not a
contract. If there is fair use of the material, then copyright doesn't apply.

------
scaryclam
It depends. What's the site? How are you scraping? Is it going to cause
traffic issues on the server? Are you breaking terms of service or any other
reasonable requests to not scrape? Did you even talk to the site owners? Are
you scraping content behind a paywall or similar? All of these questions and
others make up the answer to your question, so there's no generalisation to
make really.

------
disiplus
the same i feel about torrenting. if you do it for your needs im ok with it
but if you do it with commercial interest one way or another im not ok with
it.

------
hodder
fantastic. Web scraping has enabled me to access a ton of data I would
otherwise be forced to access manually.

------
artilect
This is the reason the robots.txt was created, to tell web scrapers and people
building them, what is off limits.

Of course there are people building services that scrape certain sites that
appear to be off limits to you.

Those people scraping sites that are explicitly prohibited either:

a. are breaking the rules, potentially the law if it's explicitly prohibited
in a ToS, and will eventually have to deal with getting banned, or sued. It's
quite a gray area legally but here are some laws that could be used against
you:

Violation of the Computer Fraud and Abuse Act (CFAA). Violation of California
Penal Code. Violation of the Digital Millennium Copyright Act (DMCA). Breach
of contract. Trespass. Misappropriation. Source: Linkedin v. Doe Defendants

b. have an agreement with the website owners allowing them to scrape certain
portions of their site.

c. scraping data with no rules concerning it.

For example, Facebook. has a ToS for scraping:
[https://www.facebook.com/apps/site_scraping_tos_terms.php](https://www.facebook.com/apps/site_scraping_tos_terms.php)
At the bottom there is a form for those that want to get permission to scrape
the site. And their robots.txt is heavily used to control crawlers with User-
Agents they know.
[http://facebook.com/robots.txt](http://facebook.com/robots.txt)

It's rare you would run into legal issues, but possible. The question is
whether it's morally okay for you to scrape any data you want.

------
hsod
To examine the ethics of web scraping, I think it's useful to strip away all
the technical trappings and just look at it as an interaction between two
people. We'll call them Chloe the Creator and Sam the Scraper.

Chloe has expended effort/energy/resources to gather and collate information
which provides value in some way to Sam.

Sam has invested nothing, risked nothing, and expended no effort with regard
to that information.

Chloe decides, in her own interest, to give the public access to the
information.

Does Chloe have a moral right to try and impose conditions on that access?

Does Sam have a moral duty to make a good-faith effort to abide by those
conditions?

My answer to both of these questions is yes.

------
wcummings
Web scraping is my birthright, no ToS on this earth can take it away.

------
devopsproject
> I actually created a site I called blah blah blah (don't worry, this isn't
> an ad for it)

[https://media.giphy.com/media/EouEzI5bBR8uk/giphy.gif](https://media.giphy.com/media/EouEzI5bBR8uk/giphy.gif)

~~~
foxbarrington
Why don't you believe OP? If OP wanted to advertise the site, wouldn't they
just do a Show HN? Is this more effective to warrant being dishonest?

~~~
devopsproject
because it would have been a lot easier to omit the entire paragraph.

