
Ask HN: Best practices for ethical web scraping? - aspyct
Hello HN!<p>As part of my learning in data science, I need&#x2F;want to gather data. One relatively easy way to do that is web scraping.<p>However I&#x27;d like to do that in a respectful way. Here are three things I can think of:<p>1. Identify my bot with a user agent&#x2F;info URL, and provide a way to contact me
2. Don&#x27;t DoS websites with tons of request.
3. Respect the robots.txt<p>What else would be considered good practice when it comes to web scraping?
======
snidane
When scraping just behave as to not piss off the site owner - whatever that
means. Eg. not cause excessive load or making sure you don't leak out
sensitive data.

Next put yourself in their shoes and realize they don't usually monitor their
traffic that much or simply don't care as long as you don't slow down their
site. It's usually only certain big sites with heavy bot traffic such as
linkedin or sneaker shoe sites which implement bot protections. Most others
don't care.

Some websites are created almost as if they want to be scraped. The json api
used by frontend is ridiculously clean and accessible. Perhaps they benefit
when people see their results and invest in their stock. You never fully know
if the site wants to be scraped or not.

The reality of scraping industry related to your question is this

1\. scraping companies generally don't use real user agent such as 'my
friendly data science bot' but they hide behind a set of fake ones and/or
route the traffic through a proxy network. You don't want to get banned so
stupidly easily by revealing user agent when you know your competitors don't
reveal theirs.

2\. This one is obvious. The general rule is to scrape over long time period
continuously and add large delays between requests of at least 1 second. If
you go below 1 second be careful.

3\. robots.txt is controversial and doesn't serve its original purpose. It
should be renamed to google_instructions.txt because site owners use it to
guide googlebot to navigate their site. It is generally ignored by the
industry again because you know your competitors ignore it.

Just remember the rule of 'not to piss off the site owner' and then just go
ahead and scrape. Also keep in mind that you are in a free country and we
don't discriminate here whether it is of racial or gender reasons or whether
you are a biological or mechanical website visitor.

I simply described the reality of data science industry around scraping after
several years of being in it. Note that this will probably not be liked by HN
audience as they are mostly website devs and site owners.

~~~
wizzwizz4
1\. is the only one I don't like. I think you should use your real user agent
first on any given site, as a courtesy; whether you give up or change to a
more "normal" user agent if you get banned is up to you.

Oh, and for 3.: if you can, apply some heuristics to your reading of the
robots.txt. If it's just "deny everything", then ignore it, but you really
don't want to be responsible for crawling all of the GET /delete/:id pages of
a badly-designed site… (those should definitely be POST, and authenticated, by
the way).

~~~
chatmasta
I disagree. The risks are similar to those of disclosing a security
vulnerability to a company without a bug bounty. You cannot know how litigious
or technically illiterate the company will be. What if they decide you're
"hacking" them and call the FBI with the helpful information you included in
your user agent? Crazier things have happened.

Anonymity is part of the right to privacy; IMO, such a right should extend to
bots as well. There should be no shame in anonymously accessing a website,
whether via automated means or otherwise.

~~~
a1369209993
> such a right should extend to bots as well

No, it very much shouldn't, but (as you probably meant) it _should_ extend to
the _person_ (not, eg, company) _using_ a bot, which amounts to the same thing
in this case.

------
pfarrell
It won’t help you learn to write a scraper, but using the common crawl dataset
will get you access to a crazy amount of data without paying to acquire it
yourself.

[https://commoncrawl.org/the-data/](https://commoncrawl.org/the-data/)

~~~
aspyct
Cool, didn't know about this. Thanks!

~~~
Reelin
> As part of my learning in data science, I need/want to gather data.

Also not web scraping, but a few other public data set sources to check.

[https://registry.opendata.aws](https://registry.opendata.aws)

[https://github.com/awesomedata/awesome-public-
datasets](https://github.com/awesomedata/awesome-public-datasets)

~~~
aspyct
Thanks!

~~~
smcnally
also [https://www.reddit.com/r/datasets](https://www.reddit.com/r/datasets)

~~~
aspyct
The comments that keep on giving :D

~~~
analyticascent
I appreciate you starting this thread, these are great resources people are
posting.

Common Crawl is _the_ data set to master if someone wants to use the fruits of
web scraping without actually doing the web scraping.

------
montroser
Nice you to ask this question and think about how to be as considerate as you
can.

Some other thoughts:

\- Find the most minimal, least expensive (for you and them both) way to get
the data you're looking for. Sometimes you can iterate through search results
pages and get all you need from there in bulk, rather than iterating through
detail pages one at at a time.

\- Even if they don't have an official/documented API, they may very likely
have internal JSON routes, or RSS feeds that you can consume directly, which
may be easier for them to accommodate.

\- Pay attention to response times. If you get your results back in 50ms, it
probably was trivially easy for them and you can request a bunch without
troubling them too much. On the other hand, if responses are taking 5s to come
back, then be gentle. If you are using internal undocumented APIs you may find
that you get faster/cheaper cached results if you stick to the same sets of
parameters as the site is using on its own (e.g., when the site's front end
makes AJAX calls)

~~~
aspyct
That's great advice! Especially the one about response times. I didn't think
of that, and will integrate it in my sleep timer :)

------
mapgrep
I always add an “Accept-Encoding” header to my request to indicate I will
accept a gzip response (or deflate if available). Your http library (in
whatever language your bot is in) probably supports this with a near trivial
amount of additional code, if any. Meanwhile you are saving the target site
some bandwidth.

Look into If-Modified-Since and If-None-Match/Etag headers as well if you are
querying resources that support those headers (RSS feeds, for example,
commonly support these, and static resources). They prevent the target site
from having to send anything other than a 304, saving bandwidth and possibly
compute.

~~~
Lammy
> Meanwhile you are saving the target site some bandwidth.

And costing them some CPU :) It’s probably a good idea in most cases, agreed,
but there are exceptions such as if you are requesting resources in already-
compressed formats, like most image/video codecs.

~~~
kerkeslager
Frankly, it would be difficult to find a part of your post that is correct.

1\. You're never causing their server to do anything they didn't configure
their server to do. Accept headers are merely information for the server
telling them what you can accept: what they return to you is their choice, and
they can weigh the tradeoffs themselves.

2\. The tradeoff you think is happening isn't even happening in a lot of
cases. In a lot of cases they'll be serving that up from a cache of some sort
so the CPU work has already been done when someone else requested the page.
CPU versus bandwidth isn't an inherent tradeoff.

------
rectang
In addition to the steps you're already taking, and the ethical suggestions
from other commenters, I suggest that you aquaint yourself thoroughly with
intellectual property (IP) law. If you eventually decide to publish anything
based on what you learn, copyright and possibly trademark law will come into
play.

Knowing what rights you have to use material you're scraping early on could
guide you towards seeking out alternative sources in some cases, sparing you
trouble down the line.

~~~
yjftsjthsd-h
I'm curious how this would be an issue; factual information isn't
copyrightable, and most of the obvious things that I can think to do with a
scraper amount to pulling factual information in bulk. Even if it's
information like, "this is the average price for this item across 13 different
stores". (Although I'm not a lawyer and only pay attention to American law, so
take all of this with the appropriate amount of salt)

~~~
rectang
How much can you quote from a crawled document? Can you republish the entire
crawl? What can you do under "fair use" of copyrighted material and what can't
you do? Can you articulate a solid defense of your publication that it truly
contains only pure factual information? Will BigCo dislike having its name
associated with the study but can you protect yourself by limiting your
publication to "nominative use" of its trademarks? What is the practical risk
of someone raising a stink if the legality of your usage is ambiguous? Who
actually holds copyright on the crawled documents?

You have a lot of rights and you can do a lot. Understanding those rights and
where they end lets you do _more_ , and with confidence.

~~~
yjftsjthsd-h
So I think I just was being unimaginative on "scraping"; I wouldn't have
thought to save quotes/prose, just things like word counts, processed results
(sentiment analysis), pricing, etc. In which case most of that shouldn't come
up, but yes I can see where other options are less simple.

------
sairamkunala
Simple,

respect robots.txt

find your data from sitemaps, ensure you query at a slow rate. robots.txt has
a cool off period. See
[https://en.wikipedia.org/wiki/Robots_exclusion_standard#Craw...](https://en.wikipedia.org/wiki/Robots_exclusion_standard#Crawl-
delay_directive)

example:
[https://www.google.com/robots.txt](https://www.google.com/robots.txt)

~~~
aspyct
Yeah that's a must do, but I think most websites don't even bother making a
robots.txt beyond "please index us, google". However that wouldn't necessarily
mean they're happy about someone vacuuming their whole website in a few days.

------
jakelazaroff
I think your main obligation is not to the entity from which you’re scraping
the data, but the people whom the data is about.

For example, the recent case between LinkedIn and hiQ centered on the latter
not respecting the former’s terms of service. But even if they had followed
that to the T, what hiQ is doing — scraping people’s profiles and snitching to
their employer when it looked like they were job hunting — is incredibly
unethical.

Invert power structures. Think about how the information you scrape could be
misused. Allow people to opt out.

~~~
monkpit
I tried to find a source to back up what you’re saying about hiQ “snitching”
to employers about employees searching for jobs, but all I can find is vague
documentation about the legal suit hiQ v. LinkedIn.

Do you have a link to an article or something?

~~~
jakelazaroff
Sure, it’s mentioned in the EFF article about the lawsuit:
[https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-
v-l...](https://www.eff.org/deeplinks/2019/09/victory-ruling-hiq-v-linkedin-
protects-scraping-public-data)

 _> HiQ Labs’ business model involves scraping publicly available LinkedIn
data to create corporate analytics tools that could determine when employees
might leave for another company, or what trainings companies should invest in
for their employees._

------
mettamage
Indirectly related, if you have some time to spare follow Harvard's course in
ethics! [1]

Here is why: while it didn't teach me anything new (in a sense), it did give
me a vocabulary to better articulate myself. Having new words to describe
certain ideas means you have more analytical tools at your disposal. So you'll
be able to examine your own ethical stance better.

It takes some time, but instead of watching Netflix (if that's a thing you
do), watch this instead! Although, The Good Place is a pretty good Netflix
show sprinkling some basic ethics in there.

[1]
[https://www.youtube.com/watch?v=kBdfcR-8hEY](https://www.youtube.com/watch?v=kBdfcR-8hEY)

~~~
aspyct
Great recommendations, thanks!

~~~
aspyct
I must insist. This course is great! Thanks :)

~~~
mettamage
Glad you like it! When I studied CS, I was really happy I found this course as
well. Getting some ideas about what ethics is made me a bit better in
reflecting on what the implications are with whatever I'm creating.

------
fiddlerwoaroof
My general attitude towards web scraping is that if I, as a user, have access
to a piece of data through a web browser, the site owners have no grounds to
object to me using a different program to access the data, as long as I’m not
putting more load on their servers than a user clicking all the links would.

Obviously, there may be legal repercussions for scraping, and you should
follow such laws, but those laws seem absurd to me.

------
RuedigerVoigt
Common CMS are fairly good at caching and can handle a high load, but quite
often someone deems a badly programmed extension "mission critical". In that
case one of your requests might trigger dozens of database calls. If multiple
sites share a database backend, an accidental DOS might bring down a whole
organization.

If the bot has a distinct IP (or distinct user agent), then a good setup can
handle this situation automatically. If the crawler switches IPs to circumvent
a rate limit or for other reasons, then it often causes trouble in the form of
tickets and phone calls to the webmasters. Few care about some gigabytes of
traffic, but they do care about overtime.

Some react by blocking whole IP ranges. I have seen sites that blocked every
request from the network of Deutsche Telekom (Tier 1 / former state monopoly
in Germany) for weeks. So you might affect many on your network.

So:

* Most of the time it does not matter if you scrape all information you need in minutes or overnight. For crawl jobs I try to avoid the time of day I assume high traffic to the site. So I would not crawl restaurant sites at lunch time, but 2 a.m. local time should be fine. If the response time goes up suddenly at this time, this can be due to a backup job. Simply wait a bit.

* The software you choose has an impact: If you use Selenium or headless Chrome, you load images and scripts. If you do not need those, analyzing the source (with for example beautiful soup) draws less of the server's resources and might be much faster.

* Keep track of your requests. A specific file might be linked from a dozen pages of the site you crawl. Download it just once. This can be tricky if a site uses A/B testing for headlines and changes the URL.

* If you provide contact information read your emails. This sounds silly, but at my previous work we had problems with a friendly crawler with known owners. It tried to crawl our sites once a quarter and was blocked each time, because they did not react to our friendly requests to change their crawling rate.

Side note: I happen to work on a python library for a polite crawler. It is
about a week away from stable (one important bug fix and a database schema
change for a new feature). In case it is helpful:
[https://github.com/RuedigerVoigt/exoskeleton](https://github.com/RuedigerVoigt/exoskeleton)

~~~
volkansen
If you use Selenium & Chrome WebDriver you can disable loading images by :
AddUserProfilePreference("profile.default_content_setting_values.images", 2)

------
haddr
Some time ago I wrote an answer on stackoverflow:
[https://stackoverflow.com/questions/38947884/changing-
proxy-...](https://stackoverflow.com/questions/38947884/changing-proxy-while-
data-scraping/38985146#38985146)

Maybe that can help.

~~~
johnnylambada
You should probably just paste your answer here if it's that good.

------
tingletech
as sort of a poor man's rate limiting, I have written spiders that will sleep
after every request, for the length of the previous request (sometimes length
of the request times a sleep factor that defaults to 1). My thinking is that
if the site is under load, it will respond slower, and my spider will slow
down as well.

------
coderholic
Another option is to not scrape at all, and use an existing data set. Common
crawl is one good example, and http archive is another.

If you just want meta data from the homepage of all domains we scrape that
every month at [https://host.io](https://host.io) and make the data available
over our API: [https://host.io/docs](https://host.io/docs)

------
xzel
This might be overboard for most projects but here is what I recently did.
There is a website I use heavily that provides sales data for a specific type
of products. I actually e-mailed to make sure this was allowed because they
took down their public API a few years ago. They said yes everything that is
on the website is fair game and you can even do it on your main account. It
was actually a surprisingly nice response.

------
ok_coo
I work with a scientific institution and it's still amazing to me that people
don't check or ask if there are downloadable full datasets that anyone can
have for free. They just jump right in to scraping websites.

I don't know what kind of data you're looking for, but please verify that
there isn't a quicker/easier way of getting the data than scraping first.

------
tedivm
I've gone through this process twice- one about six months ago, and once just
this week.

In the first event the content wasn't clearly licensed and the site as
somewhat small, so I didn't want to break them. I emailed them and they gave
us permission but only if we only crawled one page per ten seconds. Took us a
weekend, but we got all the data and did so in a way that respected their
site.

The second one was this last week and was part of a personal project. All of
the content was over an open license (creative commons), and the site was
hosted on a platform that can take a ton of traffic. For this one I made sure
we weren't hitting it too hard (scrapy has some great autothrottle options),
but otherwise didn't worry about it too much.

Since the second project is personal I open sourced the crawler if you're
curious-
[https://github.com/tedivm/scp_crawler](https://github.com/tedivm/scp_crawler)

------
elorant
My policy on scraping is to never use asynchronous methods. I've seen a lot of
small e-commerce sites that can't really handle the load, even if it's a few
hundred requests per second, and the server crashes. So even if it takes me
longer to scrape a site I prefer to not cause any real harm on them as long as
I can avoid it.

------
throwaway777555
The suggestions in the comments are excellent. One thing I would add is this:
contact the site owner in advance and ask for their permission. If they are
okay with it or if you don't hear back, credit the site in your work. Then
send the owner a message with where they can see the information being used.

Some sites will have rules or guidelines for attribution already in place. For
example, the DMOZ had a Required Attribution page to explain how to credit
them: [https://dmoz-odp.org/docs/en/license.html](https://dmoz-
odp.org/docs/en/license.html). Discogs mentions that use of their data also
falls under CC0: [https://data.discogs.com/](https://data.discogs.com/). Other
sites may have these details in their Terms of Service, About page, or
similar.

------
moooo99
The rules you named are some I personally followed. One other extremely
important thing is privacy when you want to crawl personal data like social
networks. I personally avoid crawling data that inexperienced users might
accidentally expose, like email adresses, phone numbers or their friends list.
A good rule of thumb for social networks for me always was, that I only scrape
the data that is visible when my bot is not logged in (also helps to not break
the providers ToS).

The most elegant way would be to ask the site provider if they allow scraping
their website and which rules you should obey. I was surprised how open some
providers were, but some don't even bother replying. If they don't reply,
apply the rules you set and follow the obvious ones like not overloading their
service etc.

~~~
aspyct
I tried the elegant way before, after creating a mobile application to find
fuel pumps around the country for a specific brand. My request was greeted
with a "don't publish; we're busy making one; we'll sue you anyway". I guess
where I'm from, people don't share their data yet...

Totally agree with the point on accidental personal data, thanks for pointing
that out!

PS: they never released their app...

------
mfontani
If all scrapers did what you did, I'd curse a lot less at $work. Kudos for
that.

Re 2 and 3: do you parse/respect the "Crawl-delay" robots.txt directive, and
do you ensure that works properly across your fleet of crawlers?

~~~
the8472
In addition to crawl-delay there's also HTTP 429 and the retry-after header.

[https://tools.ietf.org/html/rfc6585#page-3](https://tools.ietf.org/html/rfc6585#page-3)

~~~
greglindahl
Sites also use 403 and 503 to send rate-limit signals, despite what the RFCs
say.

------
tyingq
Be careful about making the data you've scraped visible to Google's search
engine scrapers.

That's often how site owners get riled up. They search for some unique phrase
on Google, and your site shows up in the search results.

~~~
MarcellusDrum
This isn't really an "ethical" practice, more like how to hide that you are
scraping data practice. If you have to hide the fact that you are scraping
their data, maybe you shouldn't be doing it in the first place.

~~~
tyingq
Depends. Maybe, for example, you're doing some competitive price analysis and
never plan on exposing scraped things like product descriptions...you only
plan to use those internally to confirm you're comparing like products. But
you expose it accidentally. Avoid that.

------
narsil
It's helpful to filter out links to large content and downloadable assets from
being traversed. For example, I assume you wouldn't care about downloading
videos, images, and other assets that would otherwise use a large amount of
data transfer and increase costs.

If the file type isn't clear, the response headers would still include the
Content-Length for non-chunked downloads, and the Content-Disposition header
may contain the file name with extension for assets meant to be downloaded
rather than displayed on a page. Response headers can be parsed prior to
downloading the entire body.

------
JackC
In some cases, especially during development, local caching of responses can
help reduce load. You can write a little wrapper that tries to return url
contents from a local cache and then falls back to a live request.

------
philippz
As many pages are at least half-way SPAs, make sure to really understand the
website's communication with their backend. Identify API calls and try to make
API calls directly instead of downloading the full pages and extracting the
required information from HTML afterwards. If you have certain data sets from
specific API calls that almost never change, try to crawl them less regularly
and instead cache the results.

------
DoofusOfDeath
You may need to get more specific about your definition of "ethical".

For example, do you just mean "legal"? Or perhaps, consistent with current
industry norms (which probably includes things you'd consider sleazy)? Or not
doing anything that would cause offense to site owners (regardless of how
unreasonable they may seem)?

I do think it's laudible that you want to do good. Just pointing out that it's
not a simple thing.

------
danpalmer
Haven’t seen anyone mention this, but asking permission first is about the
most ethical approach. If you think sites are unlikely to give you permission,
that might be an indication that what you’re doing has limited value. Offering
to share your results with them could be a good plan.

I work for a company that does a lot of web scraping, but we have a business
contract with every company we scrape from.

------
tdy721
Schema.org is a nice resource. If you can find that meta-data on a site, you
can be just a little more sure they don’t mind getting that data scraped. It’s
the instruction book for teaching google and other crawlers extra information
and context. Your scraper would be wise to parse this extra meta information.

------
jll29
The only sound advice one can give is: there are two elements to consider: 1)
ethics is different from law 1.1) the ethical way: respect robots.txt protocol
2) consult a lawyer 2.1) prior written consent, they will say, prevents you
from being sued, and not much else.

------
sudoaza
Those 3 are the main, sharing the data in the end could be also a way to avoid
future scrapings.

~~~
mrkramer
That's an interesting proposition. For example there is Google Dataset Search
where you can "locate online data that is freely available for use".

~~~
aspyct
Didn't know about that search engine. Thanks a lot! Actually found a few fun
datasets, made my day :)

------
imduffy15
[https://scrapinghub.com/guides/web-scraping-best-
practices/](https://scrapinghub.com/guides/web-scraping-best-practices/) may
be of interest to you.

------
Someone
IMO, the best practice is “don’t”. If you think the data you’re trying to
scrape is freely available, contact the site owner, and ask them whether dumps
are available.

Certainly, if your goal is “learning in data science”, and thus not tied to a
specific subject, there are enough open datasets to work with, for example
from
[https://data.europa.eu/euodp/en/home](https://data.europa.eu/euodp/en/home)
or [https://www.data.gov/](https://www.data.gov/)

~~~
pxtail
Where this _' best practice is “don’t”'_ idea comes from? I saw it couple of
times when scraping topic surfaces. I think that it is kind of hypocrisy and
actually acting against own good and even good of the internet as whole
because it artificially limits who can do what.

Why are there entities which are allowed to scrape web however they want (who
got into their position because of scraping the web) and when it comes to
regular Joe then he is discouraged from doing so?

~~~
Someone
In my book, “Not best practice” doesn’t imply “never do”, but web scraping
should be your option of last resort. Doing it well takes ages, and time spent
doing it will often detract you from your goal.

As I said, in this case “learning data science” likely doesn’t require web
scraping; it just requires some suitable data set.

The OP claimed in another comment that that doesn’t exist, but (s)he doesn’t
say what dats (s)he’s looking for, so that impossible to check.

------
adrianhel
I like this approach. Personally I wait an hour if I get an invalid response
and use timeouts of a few seconds between other requests.

------
abannin
Don't fake identity. If the site requires a login, don't fake that login. This
has legal implications.

------
avip
Contact site owner, tell them who you are and what you're doing, ask about
data dump or api.

------
brainzap
Ask for permissions and have nice timeout/retries.

------
sys_64738
Ethical web scraping? Is that even a thing?

~~~
kordlessagain
No, it's not and discussing it like it is a thing is irrational. Ethics are
based on morals and morals are based on determining a "right" course of action
for a given act.

Just because something is legal, by absence of law, doesn't mean it's right or
fair for all cases. Just because something is illegal (copyright) doesn't mean
it's not right or fair for all cases. What if the information saved a million
lives? Would it still be ethical to claim "ownership" of that information?

What if the information caused a target audience to visualize that thing over
and over again? Is it right to allow that information out into the public at
all?

g'disable javascript in your browser'

~~~
matz1
And moral is subjective. There is no one "right" course of action.

