Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: What’s the legality of web scraping?
214 points by malshe on June 23, 2019 | hide | past | favorite | 113 comments
I teach machine learning applications to masters students. Many students ask me whether it’s legally OK to scrape websites without using an API and use the data for their projects for my course. I usually just direct them to use APIs with authentication or use tabular datasets on Kaggle, data.world, etc., because I’m not a lawyer and I don’t know the legality of web scraping. The most relevant article I know is from EFF (https://www.eff.org/deeplinks/2018/04/scraping-just-automated-access-and-everyone-does-it) but it’s more than a year old.

Can anyone who knows the law please guide me on this issue? Note that the concern is less about what’s ethical and more about what’s legal. This will also help me in my research because these days some reviewers are raising this concern when they see authors used web scraped data. Online there are a ton of opinion pieces but nobody is clear on the legal side of it. Mostly people oppose scraping because they think it’s unethical.

The current state of the art is hiQ vs LinkedIn:


Basically: if it's publicly visible, you can scrape it.

Caveat: the case is still making its way to the Supreme Court.

Edit: There's also Sandvig v. Sessions, which establishes that scraping publicly available data isn't a computer crime:


Edit2: Two extra common sense caveats:

- Don't hammer the site you're scraping, which is to say don't make it look like you're doing a denial of service attack.

- Don't sell or publish the data wholesale, as is -- that's basically guaranteed to attract copyright infringement lawsuits. Consume it, transform it, use it as training data, etc. instead.

As someone who used to run a heavily trafficked and heavily scrapped site, some tips from an operator:

- Make sure your scrapper has both a reasonable delay (one request per second or slower) and a proper backoff. It you start getting errors, back off. We never cared about scrapers, until we noticed them, and we only noticed them if they hit us too hard, we told them to back off, and then they didn't.

- Look deep for an API. A ton of people would scrape reddit without realizing we had (an admittedly poorly marketed) API for doing just that.

- Respect robots.txt. That was another way to get noticed quickly -- hitting forbidden URLs. If you hit a forbidden URL too often, you'd start getting 500 errors, and if you didn't back off, you'd get banned from using the site. It was an easy way to tell if someone was not a well behaved scraper.

> we only noticed them if they hit us too hard, we told them to back off, and then they didn't

Response code 429 is your friend: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes#429

If you're writing a scraper, you should handle any http error code and back off.

And if you want to get really pedantic, 429 didn't exist when we did this. It wasn't approved until April 2012 and the first patches for it didn't show up until around 2014. We could have monkey patched if we really wanted to, but we didn't really want to.

You should back off whatever the error. This is on the client to implement. 429 is not directly supported by HTTP libraries to make the client wait, so I don't feel like it would help getting misbehaving bots to slow down.

#3 respect robots.txt

This is a polite thing to do, but I don't think that there is any legal precedence for it being an actual requirement. Notably, both Apple and The Wayback Machine publicly disregard robots.txt files [1]. I would be very curious to read any court ruling that determined a robots.txt file needs to be respected.

[1] - https://intoli.com/blog/analyzing-one-million-robots-txt-fil...

It depends on the intention. You should respect robots.txt for search indexing, for example, but not necessarily for something like archiving or creating alternative page layouts (e.g outline/reader view).

Wayback machine does look at robots.txt - https://help.archive.org/hc/en-us/articles/360004651732-Usin...

They look at them, but they don't follow them strictly [1]. They make judgement calls on what they should do rather than treating robots.txt files as a legal contract.

[1] - https://blog.archive.org/2017/04/17/robots-txt-meant-for-sea...

It's a pity that robots.txt doesn't let you specify what the crawler can do with the resources it's allowed to fetch. I think that if we had such a feature (or something similar, like a "License" header) standardized early enough , a few issues regarding crawling and search engines would be moot, or at least easier to solve automatically.

True but all the commercial websites would use it to ban scraping then.

If we're talking about being polite, then #4 respect the TOS. Especially requests per minute.

It’s the TOS itself that is legally tenuous, so you’re best bet is to completely ignore it. There’s no picking and choosing part s of it. Ignore all of it or implicitly accept all of it.

”which establishes that scraping publicly available data isn't a computer crime”

…in the USA (possibly even “in DC”). Also, ”isn't a computer crime” doesn’t imply ”isn't a crime”. Copyright law likely still applies, for example.

This being the top comment, it must be noted that HiQ v. LinkedIn is very much the exception to the well-established rule.

I'm not a lawyer but I did receive a C&D from a Fortune 100 that ultimately shut my project down. I was not selling or exposing any data directly -- it was purely consumed on the back end.

I was not hammering their site, but aggregating and caching requests such that people who used my project ultimately had orders-of-magnitude lower impact than they would've had otherwise.

The data we were sampling was fundamentally non-copyrightable in the US per Feist v. Rural Telecom; just a compendium of places, dates, and times (in the EU, raw data without substantial creative components is copyrightable), but because it was on their servers, and because we had to extract it from a HTML page that constituted a creative work, the CFAA and the Copyright Act were against us.

I talked to many different lawyers, including lawyers who had successfully defended companies from scraping-related lawsuits, and they all told me, unanimously, that it was hopeless. The law and the legal precedent is 100% in favor of the site being scraped. Essentially, it may not be illegal until they tell you to stop, but after that, it's unquestionably illegal. There is no public right-of-way on the internet.

My case is by no means unusual; it happens to several small companies on a daily basis, and it's a critical component in the ability of BigTechCos to maintain their walled gardens and effectively use legal mechanisms to route around the web's inherent distributed properties. All this "decentralized internet" stuff misses the point that the decentralization is not a technical problem, but a legal and social one.

Eric Goldman's blog [0] is a great resource that has consistently followed law related to scrapers for several years. He discusses hiQ v. LinkedIn at [1].


The applicable federal statutes, which are primarily the CFAA and the Copyright Act, don't leave much wiggle room at all on this topic, and neither does the overwhelming majority of case law. Precedents established in the 80s like MAI v. Peak have been consistently misapplied to screen scraping.

There are two particular onerous prongs of the law here: first, the CFAA's "authorized access" stipulations, and second, interpretations of the Copyright Act that hold RAM copies of data are sufficiently tangible to be potentially-infringing.

The CFAA makes it both a crime and a tort to ever access a server in a manner that "exceeds authorized access" -- essentially, as soon as the company indicates that they don't want you to talk to them, if you talk to them again, you're dead meat (craigslist v. 3taps among others).

Most companies include boilerplate in their Terms of Service that says the site cannot be accessed by any automated means and generally successfully argue that you were thereby on notice regarding the extent of your authorized access as soon as you did anything that constitutes enactment of that contract, which generally means accessing anything beyond the front page of the site ("clickwrap" or "linkwrap"), and almost certainly means anything that involves logging in, submitting forms, etc.

Re: the Copyright Act -- until it's modified to clarify that RAM copies are not independent infringements and to enshrine the rights of users to extract their own copyrighted content from another's copyrighted wrapper, it's going to be a potential infringement every time your software downloads someone's page. The real-world analog of the "RAM Copy doctrine", as it's called, would be that every time your eye reflects the image of a copyrighted work into your brain, you've made a new infringing copy. When it gets to court, that's what scrapers deal with -- and they almost always lose.

On the API front you may be able to argue that a simple JSON structure isn't sufficiently creative to qualify for copyright protection, but that would be blazing a new trail (and still leaves the CFAA to worry about). In almost all cases, something as complex as the JavaScript and the HTML that you get from $ANYWEBSITE.com, just loading it on an unapproved device is probably an infringement. That each digital load/transform is a potential infringement is how you hear about millions of infringements in file sharing cases, etc., because they're claiming each time you copied that data from your hard drive into your RAM, it was a new independent infringing copy.

Seriously, sit down and read the law, and then read the dozens of cases where this has been litigated previously. HiQ v. LinkedIn is a very limited anomaly in this pantheon, still very early in the cycle, and NO ONE should be taking it as a guiding star, at least not until it hits the Supreme Court and they come down reversing all the old precedent around this.

If you are going to build a business that depends on scraping, ONLY do so with the backing of mega-well-funded VCs, etc., who are able and willing to take on the powerful lobbies, and who are funding your company at least as much for its potential to break legal precedent as for its commercial viability.

Final note: expect no help from FAANG et al on this. Without the CFAA, their walled gardens are dead in the water. It is a critical tool used by MegaCos to retain their digital monopolies. "Network effect" means something, but it's only strangling the web to death because there are $1000/hr law firms enforcing it behind the scenes. Without that, we'd have automatic multiplexed Twitter/G+/FB streams a long time ago. They shut down aggregators because they need to control the direct interface to the user -- if they're relegated to a backend data provider by someone with a better user experience, they're very vulnerable. This realization is what motivated Craigslist's rapid reversal on scraper-friendliness and sunk 3taps, and been the death of many potentially innovative early-stage companies.


tl;dr The long and short of it is that until Congress passes revisions to the CFAA and the Copyright Act and/or until the Supreme Court comes down with a wide-ranging ironclad reversal of the last 30 years of case law on this topic, it's going to be perilous for anyone whose business depends on scraping.

And all this is at the federal level -- many states have enacted similar statutes so they can get in on the "put hackers in jail" action, and these battles will have to be fought at the state level too.

[0] https://blog.ericgoldman.org/ [1] https://blog.ericgoldman.org/archives/2017/08/linkedin-enjoi...

> I talked to many different lawyers, including lawyers who had successfully defended companies from scraping-related lawsuits, and they all told me, unanimously, that it was hopeless.

How long ago was this? It seems like the courts have shifted their position on this over time and only very recently (as in the last year) have they started to take a more permissive stance on scraping.

The paper linked elsewhere in this thread does a great job of summarizing the trend: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3221625

My experience was mid-2015. Hopeful signs have indeed become more frequent, but law moves at an absolutely glacial pace. Things are not going to change substantially for a few more years at the bare minimum.

We're in a good spot socially right now, as the tech behemoths are no longer perceived as plucky upstarts and quirky computer whizzes, but instead as creepy 1984-ish overlords. So I think the stage is set for upheaval -- maybe even some Congressional action if someone can tie this to the "deplatforming" thing that has Republicans fired up -- but we're a ways out yet, especially if we're just going to be crossing our fingers for a favorable SCOTUS ruling.

Compare the Aereo case at [0] for what is perhaps a counter-intuitive philosophical divide: the conservative side of the Court dissented from the majority in holding that Aereo should've been in the clear.

[0] https://en.wikipedia.org/wiki/American_Broadcasting_Cos.,_In....

>we'd have automatic multiplexed Twitter/G+/FB streams a long time ago.

Perhaps aggregation apps should have the client do the scraping, rather than being entirely dependent on server side scraping?

Regardless, the state of copyright and IP law in the US is abysmal. We can't trust these companies (FAANG) to keep their own press releases online for a decade, how can we let them monopolize ideas (which they fail to fully flesh out) and content? They have been shown to be inept stewards to their own content :c

> Perhaps aggregation apps should have the client do the scraping, rather than being entirely dependent on server side scraping?

Unfortunately, this is where the RAM Copy Doctrine gets us into trouble. It is not only illegal to "exceed authorized access" to a networked computer, the precedent currently considers loading any copyrighted work into RAM potentially infringing, e.g., if the rightsholder says you're not allowed to use their copyrighted work in that way, you have to present a viable fair use defense.

afaik, no one has brought suit against things like client-side adblockers and browser extensions that modify a page, but if they did, they'd be likely to prevail under current precedent.

We really need true legal protection for users to select their own user agents and to be free to access information willfully transmitted to them in the way they like, especially in the case of something like Facebook/Twitter, where the site itself is just a wrapper around other peoples' copyrighted content.

That will only happen if someone can convince enough Congresscritters to carve out an exception in the actual law, rather than relying on long-outmoded pre-internet judicial interpretations.

Power Ventures scoped down to extract only your own data out of Facebook and they still ended up owing $3M in damages.

See Ticketmaster v RMG at https://en.wikipedia.org/wiki/Ticketmaster,_LLC_v._RMG_Techn.... , where the argument that alternative user agents should be allowed was shot down. I discussed at some length here: https://news.ycombinator.com/item?id=12352450

Thanks for your detailed comment. This is super informative

How about free api data. Many come with terms that say you many not Analyse or process it further

Thanks. I was unaware about Sandvig v. Sessions

Does this mean Photos, and using those photos?

I've added a second edit which hopefully answers that question.

A timely reminder that the "new and improved, cool, friendly, loves open source, a different company" Microsoft is - beyond the slick rebranding PR - still quite happy to throw it's massive weight around, abusing the law, intent on rewriting the fundamentals of the open internet and access to information to everyone's disadvantage but their own.

You may want to review the court decision in the LinkedIn vs hiQ case[0][1].

> It is generally impermissible to enter into a private home without permission in any circumstances. By contrast, it is presumptively not trespassing to open the unlocked door of a business during daytime hours because "the shared understanding is that shop owners are normally open to potential customers." These norms, moreover govern not only the time of entry but the manner; entering a business through the back window might be a trespass even when entering through the door is not.

[0] https://arstechnica.com/tech-policy/2017/08/court-rejects-li...

[1] https://www.documentcloud.org/documents/3932131-2017-0814-Hi...

Thanks. This is the case the EFF article I linked in the original post also refers to.

whether it’s legally OK to scrape websites without using an API

I'm not a lawyer either, but making such a frivolous distinction has always bothered me --- HTTP(S) and HTML is an API, and it's the one the web browser uses. Maybe the "official" API offers some better formatting and such, but ultimately you're just getting the same information from the same source. As long as you don't hammer the server to the point that it becomes disruptive to other users, as far as they're concerned you're just another user visiting the site.

IMHO making such a distinction is harmful because it places an artificial barrier to understanding how things actually work. I've had a coworker think that it was impossible to automate retrieving information from a (company internal) site "because it doesn't have an API". It usually takes asking them "then how did you get that information?" and a bit more discussion before they finally realise.

"If you asked a hundred people to go to different pages on a site and tell you what they found, is that legal?"

The distinction is usually based on implied consent. The general legal principle is that if you own property and you grant consent for people to use it for one purpose, they are free to use it for that purpose, but you haven't necessarily granted consent for other purposes. Offering an API is a strong indication that you actually intend to allow people to consume the data with software, because otherwise you wouldn't have bothered. Offering an HTML interface is usually an indication that you intend for people to consume the data with a web browser.

Offering an HTML interface may be an indication that you also consent to allowing machines to read the data through the HTML - that's the idea behind search engines. But that's where it gets complicated, and that's why there's all sorts of other considerations to the legal question. Things like did you include the pages in question in robots.txt, did you say anything explicitly about scrapers in the ToS, does the scraper offer a way to contact its owner about abuse, has the website actually contacted them, has an IP ban been issued, is the scraping for commercial purposes, does it compete directly with the site, does it interfere with legitimate human use, etc.

When I buy a book, the publisher has no say in whether I use it for personal enjoyment, or for class discussion, or to write a (potentially negative) review, or to feed my fireplace. They have some control over wholesale reproduction via copyright law, not arbitrary power to decide what I do with it like, say, a restaurant that says I can only use my seat at the table to eat their food for a reasonable amount of time.

Why would bytes on the wire be any different from printed words on the page here?

There's a distinction between the physical pages of the book and the text contained within the book. You own the physical pages of the book; you can use it as a paperweight, coaster, toilet paper, fireplace fuel, whatever.

You do not own the copyright on the words of the book, and in many of the cases you list, the publisher does have a say in that. If you want to put on a school play based on the book, you need to get permission from the author. (My high school put on an in-house adaptation of Out of the Dust, and we had to write Karen Hesse and get her okay to do so.) If you put the entirety of the book on your website so that readers of your negative review can refer back to it, the publisher can come after you with a cease & desist or, if you ignore it, a lawsuit. If you write fanfiction based on the characters in the book, the publisher can come after you with a C&D. If you want to make a movie based on that book, you need to buy the film rights. (There's currently an interesting situation with Game of Thrones where HBO owns the film rights to the world of Westeros, but the film rights to the characters & story of Dunk & Egg are still owned by GRRM, so if the film rights to the earlier Dunk & Egg stories were ever bought by a studio other than HBO, they would have to be scrubbed of mentions of Targaryens, the Iron Throne, King's Landing, etc.)

In the pre-Internet days, the chance of enforcement was next to nil for many of these cases, because the big studios and publishers all got licenses for any IP, while class discussions, high school plays, and hobbyists never got a wide audience for their work and so the original publisher would probably never know (unless you did something really stupid like send it to them). The Internet's blurred a lot of these boundaries.

Copyright forbids specific actions (reproduction), it doesn’t let the publisher set arbitrary terms on my consumption and use of the text.

It's more that copyright defines certain rights (hence the name) that are owned initially by the author of the work and then may be transferred or granted to other parties for compensation. The exact rights specified are defined by statute, and then case law provides specific precedent for what it means. So again, consult a lawyer.

But for a concrete example - one of the exclusive rights bestowed by copyright is the right of reproduction. (It's not the only one, BTW: performance is another one specific enumerated, as is distribution, as is creating derivative works.) What does that mean? Well, courts have ruled that if you take an exact digital copy of a work, as sold to the public, and publish it for free on a torrent site, that's infringement. They've also ruled that there are various "fair use" exceptions that give implicit rights to the general public even when a work is under copyright. If you quote a sentence from a 300-page book to support a point in an academic paper, that's not infringing.

Where's the boundary? Consult a lawyer, because there's lots of case law. I remember that when I was at Google, there was a big debate over how big the snippets (the little summaries of text on the results page could be). 2 sentences was fine. A paragraph was dodgy. Showing the entire page was a big no-no. Showing the entire page when the user clicks on "cached" was okay when I was there (I don't remember what the justification was for that), but that option has since disappeared, so I wonder if they ran into problems. They got around it with AMP, which requires explicit opt-in from publishers and so has an explicit consent.

It's not all that different from regular property rights in that regard. You own land. What does that mean? Well, normally it means that you can build a house on it - but not if you have a conservation easement on the land, or if local zoning codes forbid the type of dwelling you want. It normally means you have the right to keep other people off your land - except that if your property completely surrounds somebody else's property and cuts them off from a public street, you're required to grant them an easement so that they can cross your land to get to their dwelling. There are other sorts of easements you can grant, too, which are all ways of either granting other people some of the rights associated with your property (but not all of them) or restricting yourself from having some of those rights.

but making such a frivolous distinction has always bothered me

Dismissing various legal and social conventions as 'frivolous distinctions' is, in the end, probably a more harmful viewpoint than the inconveniences the 'distinctions' introduce. It's also too easy to apply it in arbitrary and self-serving ways. Scraping data off some website? Frivolous distinction. Someone hoards your personal data? Venal violation of your privacy rights.

I agree. This is exactly why I made that distinction.

>IMHO making such a distinction is harmful because it places an artificial barrier to understanding how things actually work. I've had a coworker think that it was impossible to automate retrieving information from a (company internal) site "because it doesn't have an API". It usually takes asking them "then how did you get that information?" and a bit more discussion before they finally realise.

Yes! Similar pet peeve about e.g. "You can't use encryption in Gmail." No, nothing stops you from encrypting the message outside of Gmail and pasting the ciphertext in your message's body. It's that e.g. there might not be native support in the web client.

The rule of thumb seems to be:

- If the website offers the data publicly (without authentication), it's free to scrape.

- If the data isn't protected by copyright or trademark, (e.g. public data, such as an address of a house), it's free to reuse.

- If you use the data to compete with a big company, they will sue you regardless.

Court resolutions will vary on the court and judge. https://en.wikipedia.org/wiki/Web_scraping#Legal_issues

> If the data isn't protected by copyright or trademark, (e.g. public data, such as an address of a house), it's free to reuse.

At the risk of stating the obvious, most data you'll find is protected by copyright. Eg this comment is written by me so according to nearly all jurisdictions in the world, I own the copyright (unless HN has a clause that I agreed to when I signed up that I sign it away, like stack overflow has).

Most forums, blogs, essays, articles, news sites, recipes and song lyrics are covered by copyright. I'm pretty sure that a webshop's blarb about why product x is good is covered by copyright.

Copyright protection on recipies is much more limited than the other creative works you mentioned.

If you're scraping for more factual information, in some juristdictions, such as the US, there's a good chance those aren't subject to copyright. Things like addresses, opening hours, prices, inventory (but not a description of the inventory), etc can be very useful to scrape and present in different ways.

> If the data isn't protected by copyright or trademark, (e.g. public data, such as an address of a house), it's free to reuse

Careful with this one. It's possible that it could be copyrighted in both the US and Europe (and also have some beyond copyright protection in Europe--more on the European situation later). In the US, a collection of data might count as a "compilation", defined in 17 USC 101:

> A “compilation” is a work formed by the collection and assembling of preexisting materials or of data that are selected, coordinated, or arranged in such a way that the resulting work as a whole constitutes an original work of authorship. The term “compilation” includes collective works

In the case of a compilation like a collection of house addresses, the important thing is whether the selection and arrangement of the data was sufficiently creative. The big case on this was decided by the Supreme Court in 1991. The cite is Feist Publications, Inc., v. Rural Telephone Service Co., 499 U.S. 340 (1991).

Briefly, the compilation in that case was a book of telephone listing. There was no doubt that it had taken a lot of work to produce, and up to this point copyright law followed the "sweat of the brow" doctrine, which basically means that if you put a lot of time and effort into making something in a category that can be covered by category, you could get copyright.

In Feist, the Court said that it is a Constitutional requirement for copyright that the work must actually be creative. It didn't take much creativity to qualify, but there had to be a spark of creativity in there. In the Feist case, they found that the telephone book in question was just an alphabetical lists of all phone users in a region, which the telephone company was required by law to make. There was no creativity in either the selection or arrangement of the data, so no compilation copyright.

Based on Feist, then, a list of all house addresses in a region, sorted by address, or owner, or something like that, would probably be up for grabs. If it is a subset of the houses, then it is possible that selection was sufficiently creative to allow copyright. Same goes for a clever arrangement or presentation of the data, although if what you are using it for doesn't copy the arrangement or presentation they compilation copyright might not cover your copying.

BTW, in the particular case of address data, if your application doesn't actually need specific house addresses but instead just needs to know all the valid streets in a US state, and the address ranges on those streets, look at how that state handles sales tax. Sales taxes are usually based on street address, and the states make available databases that list all streets and the tax rates for each address range within the street.

If the state is one of the states that have joined the Streamlines Sales Tax arrangement, you can get their data here [1]. All the states part of the SST group (around half of the states) agreed to a common format for the data. I think most non-SST states also make the data available in a reasonable form, so the approach of using tax data to get address information works in them, too, just not as conveniently.

Most of the rest of the world also recognizes some kind of copyright on data collections, similar to the compilation copyright in the US, for data collections that are selected or arranged with sufficient creativity. This is part of the TRIPS trade agreement.

In the case of scraping for academic purposes, it might be OK even if it would otherwise be a copyright violation due to fair use. If it is a state owned school, it might not matter because of sovereign immunity which greatly limits the ability of citizens to sue a state government for violations of Federal laws.

Some places, including most of Europe, also have a sui generis database right that creates a property right separate from copyright in databases, based on the effort to put together the database (i.e., the old "sweat of the brow" theory). I'll just point to Wikipedia for those who want more on the sui generis database right [2].

Oh, I suppose if the house addresses were for houses in Europe, then besides copyright and the sui generis database right, you might also want to consider whether or not scraping and using the data might have GDPR implications for you.

[1] https://www.streamlinedsalestax.org/Shared-Pages/rate-and-bo...

[2] https://en.wikipedia.org/wiki/Sui_generis_database_right

Complicated. Ask a lawyer. It depends a lot on the specifics of what you're doing, and case-law makes a lot of very subtle distinctions based on exactly who you're scraping, what their ToS says, how they present the ToS, how much data you take, what you do with that data, is it public, is it facts & numbers vs. opinion & expression, how much you might inconvenience their other users and staff, whether you're a direct competitor of them, etc.

I suspect you'll actually get different answers depending on which lawyer you ask. If you've got deep enough pockets you can probably ensure you get the answer you want, and if you have really deep pockets you can probably ensure the court gets the answer you want. But if you're just a student who doesn't want to end up in court, there are potential minefields there.

if you're just a student doing research, the risk that you'll end up in court is near zero regardless of other details. Any company would have to argue that you are 'causing damages' in order to sue you. So your research would have to be harming their servers, siphoning away their customers, or otherwise materially harming the company.

As an alternative to manual scraping, you can use CommonCrawl[0] or other open data sets, such as those provided by AWS[1]. That should alleviate any legal concerns (I think. I'm not a lawyer, but I'm sure CommonCrawl and Amazon have lawyers), and it's considerably faster than scraping. On top of that, you don't end up placing an unnecessary load on random websites.

[0] https://commoncrawl.org/

[1] https://registry.opendata.aws/

Thanks. I did not know about CommonCrawl

IANAL, but I have done tons of web scraping over the years.

My tips:

- Keep careful control of the rate you scrape. Every time I have ever heard of someone getting negative feedback it is because they have scraped pages at a rate that caused an impact on the website they were scraping. If you don't cause a noticeable increase in traffic/load nobody will check to see what is going on, and generally nobody has a reason to care.

- Some sites are notoriously aggressive at going after people, such as craigslist. I wouldn't try to scrape them.

- Use some kind of proxy!

- Use some kind of proxy!

Many proxies, in random order, would be the best.

That brings up another curious question: What's the legality of posting a site to something like HN or Slashdot and effectively getting it DDoS'd...?

> posting a site to something like HN or Slashdot and effectively getting it DDoS'd

I imagine there's some reading of the CFAA that could theoretically land you in hot water for this, but this is silly.

Intent is very important. Can one sue or prosecute a popular food critic for writing something about a restaurant, causing lines so long that long-time regulars can't get a seat anymore?

On the other hand, you have things like booter services (essentially, DDoS as a service). Continuing the analogy, I imagine if you hired 100 people to physically block the entrance of a restaurant for some reason, you would be on the hook for damages in civil court and something along the lines of "disturbing the peace" in criminal court.

This is a great question, one that is very important to our business [0] which crawls many of the major social media platforms.

Andy Sellars [1] published a paper year ago on the topic titled "Twenty Years of Web Scraping and the Computer Fraud and Abuse Act" [2] which puts the topic in a great perspective. Many of the cases are not very clear cut and sway from one direction to another. We are currently in the up where courts side with the "crawlers" which may change in couple of years.

[0] https://pex.com

[1] https://twitter.com/andy_sellars

[2] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3221625

Teach your students to ensure there’s a delay between requests so they aren’t hammering anyone’s server, and follow the rules in the robots.txt. I’ve scraped more than a billion pages without any issues.

He asked about legality, not technical difficulty.

Actually many of the students are technically competent to do the scrpaing mostly using Python and I am pretty sure they learned not to overwhelm web servers.

Just because it’s technically feasible does not mean it’s legal or ethical.

I'd say the act of web scraping alone, is almost never unethical if you are careful to not cause undue load to servers. From an ethics, not legal perspective, I don't see a whole lot of difference between your computer's silicon eyes and your organic eyes just looking at something that's already in plain view.

It might be illegal in some jurisdiction; IANAL but I think you can just get out of that jurisdiction and scrape away if that is the case. It might violate some ToS but ToS isn't law; the consequences of violating a ToS are usually on the order of getting your IP banned.

What you do with the stuff you scraped can be ethical or unethical.

What makes it unethical?

Why should I be treated differently than search engine spiders?

If somebody doesn’t want their site scraped then they can let people know with robots.txt. Get off your high horse.

They never said it was unethical.

Likewise, just because its not legal or in some perspective its unethical, doesn't mean one should not do it.

I teach software development in a data science master's degree. We learn about web scraping, it is an important skill for a future data scientist IMHO, as the web is the largest and most important dataset in the world.

Google is massively scraping the web and is building products on top of the data, e.g. flight/hotel search. Why shouldn't we be allowed to do the same?

As others pointed out that one should take care about ToS.

> Google is massively scraping the web and is building products on top of the data, e.g. flight/hotel search. Why shouldn't we be allowed to do the same?

The short answer is because they have the muscle and you don't.

This was litigated in Perfect 10 v. Amazon, and the only way out was for the panel deciding the case was to claim Google's use is "fair" due to its unprecedented and "transformative" nature, which basically should be read as "we don't want to face the public scorn of being the judges responsible for shutting down Google Images". Such advantage is unlikely to be a factor in less-prominent cases.

Even if you believe you can convince a panel of judges that your project specifically meets the four-prong test for fair use, it takes millions of dollars to litigate a case that far, which is well outside the realm of possibility for most independent projects.

Flight and hotel search is a good example. Why can't you find Southwest fares on any aggregator? People try to scrape them all the time, and as soon as they come to Southwest's attention, they get C&D'd and shut down. This is a common and well-established practice and hundreds of companies die by it every year.

Ok. As i am not a lawyer i cannot say anything about the us. In germany there are many meta search or aggregator services for various things. All of them use web-scraping extensively. E.g. popular price comparison: https://geizhals.de/.

The argumentation about "tranformative" nature of something making it exceptional and above the law sound not intuitive to me.

Google's flight and hotel search doesn't come from scraping the web, it comes from ITA, the Google subsidiary that operates a Global Distributuon System.

Building flight and hotel search by scraping is possible, but will make whoever you scrape very angry very quickly, because they're paying their technology vendors per search, so they watch their look to book ratios very closely.

So and where does ITA gets the data from? German competitors like swoodoo getting the data 100% from web-scraping.

ITA is in the same business as SABRE, providing inventory management for airlines and some hotel chains. Many airline's own site will run searches with ITA or similar.

Thanks. Do you have any set guidelines for students when they should not scrape a website?

> the concern is less about what’s ethical and more about what’s legal.

Please reconsider this position. You're teaching the future generation of engineers and scientists. Even if it's not strictly the topic of your course, please don't teach your students that everything that's technically legal to do is fine. Show that being socially conscious matters as well. Everybody will be better off.

I think I didn't frame my statement well. Although I don't make absolute claims about ethics, I tell the students that some practices are considered unethical by some people because of XYZ reasons. I leave up to the students to make up their mind because all of them are adults and many of them are actually much older than me. I have lived and worked in several countries around the world which has taught me that talking about ethics in an absolutist fashion is a terrible idea. Once I was teaching a bunch of internatioanl students predictive models to screen job candidates when the candidate pool is enormous to tackle manually. Some students felt that it was unethical to use algorithms to decide a human being's job prospects. I know it may sound weird to many folks on HN but people really have strong feelings about these issues on ethical grounds and it varies a lot from person to person and culture to culture.

Equally importantly, not everything that is illegal is immoral.

I completely disagree on 2 levels:

First, teachers shouldn’t be teaching morals. Specially in college and university. The slippery slope between morals to politics is a dangerous one. I rather them focusing on their actual course materials

Finally, there is nothing wrong with scrapping on ethical standpoint if you don’t DDOS the target services. It gave us search engines. And that’s probably one of most important breakthrough for humanity in the past few decades.

The question obviously came up during their teaching, so it's become part of the course, whether they want it or not. OP also says that their peers think there are ethical questions in regard to scraping.

I don't see where you see politics in how they handle such questions. I'm not advocating they go on an extended lecture about their personal views on the political system that made the laws and what not.

I'm saying that there's a difference between handling these kind of questions with "if you're not sure, maybe you should kindly ask the publisher of the data if they would be ok with you scraping/using it that way" and "if your lawyer says you're in the clear, fuck them and scrape away."

The OP asked explicitly about legality not ethics. So no ethis didn't came up organically.

> First, teachers shouldn’t be teaching morals. Specially in college and university. The slippery slope between morals to politics is a dangerous one. I rather them focusing on their actual course materials

I disagree 100%. "To teach about the human anatomy, we've kidnapped Paul here, and will now cut him apart."

It's great for teaching (how better to observe what happens when you cut open a living person than ... cutting open a living person), but it's unethical (and illegal), and that's an important lesson as well.

I agree completely. Unfortunately, not only are teachers currently teaching morals, those in the social sciences teach a radical political position, as well as advocate heavily for activism. It's completely inappropriate, and I can't believe that the institutions they're employed by seem to be complacent at best, responsible at worst.

Hmmm. I wonder if you made a false dichotomy, relevant moral issues VS. course materials.

That said, I appreciate the distinction of morals and ethics. I understand ethics in the domain of 'what is good' and morals in the domain of 'what is good in society'. Fair Use suggests its ethical to use published materials for educational purposes. Whereas morals ask if what you're actually doing with the data is good for society.

Although I agree with your point about policitcs, in case of ethics, I don't take an extreme stance. I certainly discuss the ethical issues surrpounding certain practices but I refrain from preaching them what they should be doing. People should be aware of the ethical concerns of others, in particular when the issues are fast evolving.

It may depend on how you use the data — for instance, publicly sharing what you scraped is a clear copyright violation in many cases.

And in some cases scraping is a violation of ToS. (Though who knows whether that’s ever been litigated as enforceable.)

ToS presumably only applies to those that have seen them. So if you want to put data behind ToS you need to show them and be able to demonstrate that the user had seen them such as having users log in, and accept ToS at registration.

It would be extremely surprising to learn otherwise, for example that there is a jurisdiction in which site users are bound by terms they can only find by actively looking for them on the site.

Thanks for pointing out the ToS. Does the ToS apply even when someone is not logged in to an account?

It depends, but in general, yes, it can be made to apply with a small amount of well-placed boilerplate language. It's called either "clickwrap" or "linkwrap" depending on the way it's presented.

See Nguyen v. Barnes and Noble at https://en.wikipedia.org/wiki/Nguyen_v._Barnes_%26_Noble,_In.... for a recent example that represented a loosening of precedent by ruling that the ToS was not enforceable because the user did not receive adequate notice. If B&N had placed their disclaimer in a place where the user was more likely to see it, they would've been fine.

Ask a lawyer? Many are written as if they do; how enforceable that is is beyond my knowledge.

Web crawling by search engines shouldn't be far from web scraping in terms of data collection. I am wondering what is the legal boundary of web crawling for search engines? While web scraping sounds sneaky, why isn't web crawling?

You willingly submit links to a service to crawl your site, there's nothing like "consent" for scraping...

You don't, actually, most sites are discovered organically through links on other sites. Submitting links hasn't been common since the days of Yahoo and DMOZ.

You're right that "consent" is the important legal issue, but it's usually implied based on what your site requires re: authentication/authorization, robots.txt, and the controls Google has provided to let you tell them not to index a site.

Nope - it just requires someone to link to you. Since there are informational sites that list new domains, that might happen automatically.

respecting robots.txt and using publicly declared ua come to mind.

I've raised this same point on a bunch of threads about scraping in the past, but...

Scraping is fine if you ask the company and get permission!

This may seem obvious, but so many conversations about scraping seem to start from the position that it is in some fundamental way, not allowed. This is not true.

Conversations also seem to start from the assumption that you need to scrape the whole web, which again is not true.

If you're teaching a machine learning course, perhaps you have a project on classifying... cars. Do you need to scrape the whole web to get a bunch of data about cars? No. Could you get away with scraping just Autotrader or a similar site? Maybe! Why not ask them! If you clearly state that it's for learning, that credit will be given, etc, you may find them quite amenable to it.

I work at a company built significantly around web scraping, and we have contracts with all of our scrape targets that confirm we are allowed to scrape them.

biggest scraper of the world? google. do they obey robots.txt? not a chance, they really don't care. So do what google does, which is basically they run it like they own the world and guess what? it's actually legal

Funny thing is that it is legal for them and same rules do not apply for tiny plankton. So when it comes to scraping there is no choice but move in the gray area, hope that you don't get caught (or that they will notice too late and your product will be noticeable and large enough that it will be allowed to join big fish club)

What jurisdiction are we talking about? Laws aren't the same everywhere.

We are based in the US

is it illegal anywhere?

Yes, in Sweden it is.

Let me put them on my blacklist...

I work in Fintech. One of our products is "alternative data"... Where we sell financial datasets to other financial institutions, mainly hedge funds.

Typically the client will ask you to fill out a questionnaire about how you create or generate the data. There are lots of questions about web scraping.

The general sense is that these firms are more and more sensitive to purchasing data that has been scraped... Especially if it relates to individuals or social media.

This is really interesting!

I’ve seen projects where the company outsourced the scraping to contractors with just vague instructions to “source the data.” That way the company is insulated somewhat. Not entirely. And the ethical issue is still there. It’s not an answer to your question, but this does tell you what some companies do in practice to sidestep the issue. Might work for some research projects too.

There are two specific concerns:

1) Copyright 2) Terms of service.

If doing hobby/education projects and not publishing what you create, copyright isn't really relevant.

As for violating terms of service (which is very likely), that's not "illegal", it just opens you up to being sued. Which is very unlikely, if no one is making money out of it, or hurting the service itself.

Hi all, I read all your comments [around 10 pm US ET]. Thanks a lot for taking time to share you knowledge and thoughts! I have a much better idea about the legalities. This will help me immensely in my teaching as well as research. In fact, this will also help my students in their current and future jobs.

Disclaimers: IANAL. And I run https://serpapi.com. I can give you free credits to your students for ML uses if you want.

Legality highly depends on where you are.

In the US, scraping of public data is a fair use exception protected by the first. If you have to sign in to access the data, you then might be bound by the ToS.

In Europe, scraping of public data can be against several laws. Notably GDPR, the new copyright law, and you might be infringing copyrights on database as defined by the CNIL.

Hey just registered, really neat. You don't support 'popular times' results for local places. This may not even exist in your area so you may not be aware of it.

I've always really wanted to make a terminal app to keep track of how busy my local places are. Not saying I'd become a customer who would keep the lights on or anything like that, but at the very least it would make a cool demo.

Thanks a lot! We are based in the US. I will get in touch with you.

His advice is incorrect. Scraping of public data is by no stretch of the imagination "protected by the First".

Thanks. I was unaware of this case.

The CFAA is arbitrarily enforced and it is impossible to know if you are safe legally. People in this thread are saying that publicly accessible data is safe to scrape but that certainly wasn't the case in United States v. Andrew Auernheimer.

Sandvig v. Sessions is more recent and says otherwise for publicly available information:


I think this is the most important sentence from that article:

does not make it a crime to access information in a manner that the website doesn’t like if you are otherwise entitled to access that same information.

I think the legal developments around this topic have been fluid, which has made it difficult to keep a track of the current state of the matter. I will check out the case you mention.

Several comments suggest you ask a lawyer. AFAIK, a lawyer can’t answer the question without doing the work same as a doctor can’t tell if your sick, without examination.

I know this isn’t the answer you are seeking, but it might help you find more examples— the area of copyright and fair use has a longer history with digital images. Here’s an example legal court case showing, as others have noted, the ruling judge has great impact on the outcome: “Court Rules Images That Are Found and Used From the Internet Are 'Fair Use'” By Jack Alexander, 2018-07-02 [1]

Maybe your educational institution has already done some legal work related to issues of copyright and educational use?

Here is an example from a university where they have done the legal work and constructed further guidelines to determine safe harbor guidelines.

“The use of copyright protected images in student assignments and presentations for university courses is covered by Copyright Act exceptions for fair dealing and educational institution users. [...] In certain circumstances you may be able to use more than a "short excerpt" (e.g. 10%) of a work under fair dealing. SFU's Fair Dealing Policy sets out "safe harbour" limits for working under fair dealing at SFU, but the Copyright Act does not impose specific limits.”

[1]: https://fstoppers.com/business/court-rules-images-are-found-...

[2]: I want to use another person's images and materials in my assignment or class presentation. What am I able to do under copyright](https://www.lib.sfu.ca/help/academic-integrity/copyright/stu...)

Thanks a lot for sharing these links. I will follow your advice and talk to the university's legal folks about this.

If your university is not already working on this, it seems you're ready to raise the issue. Good luck. Please keep HN updated.

I'm puzzled by why anyone would think its not ok to scrape? What you do with the data is what matters, as long as its transformed you're in the clear.

Scraping is the entire business model of every search engine.

What's public, can be scraped.

Same applies to radio streams or Netflix videos: once they're streamed, you can register the stream legally for yourself.

You can't copyright facts.

As journalists, we scrape things to collect information used toward transformative analysis. Not straight-up mirroring. Facts, as stated by an entity. So we've never run into a legal issue doing this as long as we used the scrapes to synthesize results into data. For example, map of restaurant closures by the health department, with statistics and graphs of violation frequencies. Or analysis of lawyer performance by cross-referencing a state judiciary database search with their team member lists for success rates and other stuff.

Most of the sites we scraped were county, state or federal government sites and they contained information available in the interest of the general public. However, we crawled tons of private sites as well and as long as we wore white hats we considered it fair game.

We typically tried to scrape things fairly without causing technical issues but to be honest, we ignored robots.txt directives all the time but timed it do happen during off hours, with backoff mechanisms in case we contributed negatively to computational loads. The typical issues we ran into were overeager system administrators who squashed or interfered our scraping attempts under their personal interpretation of appropriateness. Sometimes they sicced misguided lawyers after us. Most of the lawyers couldn't tell you what you were doing wrong, let alone how a site is registered, what a glue record is, how DNS differs from IP registry ownership, how collocated servers work and who owns them. They couldn't prove what we did with any of the data to even imply we violated any copyrights.

So we relied on our legal departments to clear the way in case of issues, but in 15 years of doing this, I've never once had a legal issue come up and put a stop to what we were doing under that operating premise of transforming the information into data. Our legal team never got involved for that sort of thing. There were issues, but they got resolved through communication or by reconfiguring our scrapers. Even when we've also made the raw data available to the public or other researchers, it hasn't come up as a problem.

In one case, a police department figure blocked us because they disliked our coverage. Their pretense was that our geocoding wasn't accurate enough from the information they provided, and rather than circumvent their blocking we had face to face meetings to address those concerns and mollify their concerns, on the record. They ended up providing us with additional information to meet that accuracy. In another case, the CEO of a large private company personally threatened us legally claiming we violated their terms of services for their API endpoint. However, their terms of services mentioned nothing about data retention once something became a data point and we felt we were in the clear so we kept doing it for years and nothing came of it.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact