Web scraping is everywhere, even if it's not necessarily spoken openly about or acknowledged. The publicized perception of web scraping is fairly negative, but doesn't take into account the benefits of data used in machine-learning or democratized data extraction (as in the case of this article or for building public service apps like transportation notifications), or the simple realities of competitive pricing and monitoring the activities of resellers.
Researchers, academics, data scientists, marketers, the list goes on for those who use web scraping daily.
Glad you enjoyed the article! I'm hoping that more examples of ethical data extraction will start to turn the tide of public perception.
I completely accept how important scraping is as a data source, but that doesn't make it any more legal. It's in a space right now where only big companies can take unmitigated advantage of the tool, because it'd cost millions of dollars to successfully defend a CFAA suit.
I work for Scrapinghub as well and try to understand the law around this. I can help with some pointers to why I think some web scraping isn't illegal… there are of courses some limits to this.
When the data scraped is "is publicly available on the Internet, without requiring any login, password, or other individualized grant of access", the Eastern District Court of Virginia in Cvent vs Eventbrite (https://casetext.com/case/cvent-inc-v-eventbrite) ruled one could not be deemed to be exceeding unauthorized access.
There are two ways, that I know of, that courts have ruled you can exceed your authorization:
- When the site owner has contacted you and removed your authorization in a written manner, as happened on Craigslist vs 3taps.
- By accepting the terms of service and agreeing against scraping. You have to do this through a "clickwrap" ToS, rather than a "browsewrap". You can read about the differences here: https://termsfeed.com/blog/browsewrap-clickwrap/
As a matter of policy, we don't scrape any site with a ToS with clear anti-scraping language and which forces us to create an account or "constructively agree" as part of the use of the site.
Any user wishing to revoke authorization for anyone using our platform can make an abuse report on our site– we tend to handle these within 24 hours and haven't had a single claim go further than this stage, as we aim to be reasonable and look for a way to provide value to both sides.
Most websites have anti-scraping boilerplate in their ToU. I'm pretty sure that's in the "customized" ToU I got from LegalZoom. So you're basically saying that if the data is not behind a login and doesn't require you to fill any forms that contain either a checkbox or nearby language that indicates submitting the form constitutes acceptance of the ToU, you'll scrape it, even if the ToU explicitly bans scraping.
What do you plan to do when you do receive a C&D from someone that claims you've agreed to their browsewrap ToU? Are you going to argue that your use is not unauthorized since the data is public?
I guess I assume you'll comply with any C&D since you state that receipt of a written removal constitutes a revocation of authorization. However, I don't believe that's enough for some. Check out QVC v. Resultly, where QVC sued on a CFAA claim based on browse-wrap agreements (they lost; Resultly asserted permission was granted by robots.txt and the Court agreed).
Beyond just CFAA claims, there are copyright and trademark claims attached to most scraping cases. Those have unfortunately succeeded most of the time. The most egregious is Ticketmaster v RMG, where it was ruled that RMG had violated Ticketmaster's copyright by downloading the page (specifically, making a copy of the page contents in RAM and extracting only non-copyrightable content, and then discarding the complete copy; in short, downloading the page). Facebook v. Power Ventures is also pretty brutal.
I hope Scrapinghub is well-funded enough to trailblaze some space in the law here, because we really need it.
I'm not sure Scrapinghub is funded externally as I couldn't find anything on their valuation or revenues so I assume that they are bootsrapped.
I do not discount any of what you wrote but a lot of it are imagined dangers, scrapinghub would be immune to such cases unless they sided with their customers like 3taps did. 3taps did not stop scraping for their client Padmapper. I don't think scrapinghub is willing to put their neck out for someone paying $20/month to scrape craigslist. In fact, those are the shittiest segments of this market imo, the bottom feeders who demand excessively unrealistic expectations from scraping as a blackbox magical world that will solve their problems.
I wasn't doing anything egregious. The product I offered did not compete with their products; it actually made it easier for the consumer to spend money with them. The data I was gathering is mere factual data and is not subject to copyright (though, as in Ticketmaster v. RMG, this alone will not protect from copyright infringement claims). Their site is the single place that this factual data can be found.
If I were to actually dispute this company's claims and refuse to comply with their C&D, they would sue me. This would've cost me millions of dollars in legal fees before the case was through, which is irrelevant to them but obviously well outside of my reach. There's a good chance they would've gotten an injunction legally forbidding me from continuing to offer my service almost immediately, so then I'd have been stopped from offering my product AND I would've had a pending lawsuit against me, which would've asserted some absurd dollar amount of damage, and, if Facebook v Power Ventures is any indication, there would've been a good chance that I would've been held personally liable for it.
It doesn't matter that their claims are all dependent on interpretation and grey area. What matters is that if you don't have $30-$40 million dollars sitting around, you can't take the risk of a lawsuit from a big company. Gotta earmark $1-10 million for legal fees (depending on what kind of lawyers you get; the opposing party in my case has one of the most expensive law firms in the country); set aside $5-10 million in case you lose and have to pay damages, set aside some chunk of money to continue to bear the cost of maintaining and running the business despite the legal pressure and despite the likelihood that you've been legally disallowed from selling your primary money maker pending resolution of the case, which will likely drag on for a minimum of 3-5 years, and up to 10 years is not really unheard of. Gotta have the extra $20 mil+ so that you don't pour more than 50% of your net worth into something that is very possibly a losing battle.
My lawyer advises me that the various workarounds I devised could be construed as conspiracy and aiding and abetting, even though I would no longer be making any requests to the complainant's servers at all. This also wouldn't stop the complainant from suing me for past damages or to stop the practice they dislike, even if I'm doing it through means that totally obviate the need to access any of their servers.
If I were to continue operating, the only option would be to leave the U.S. entirely for a jurisdiction that doesn't enforce U.S. judgments (since I would be sued in the US and lose by default; my lawyer indicates that merely moving my company overseas is insufficient), and not return until the statue of limitations expires on the judgment that would get registered. Even this is not foolproof because the activity would have to be obviously and unequivocally legal in the new host jurisdiction so that the company's lawsuit in that jurisdiction wouldn't get anywhere, the jurisdiction would have to decline to enforce judgments on U.S. persons, and they'd have to be impervious to attempts by one of the world's largest companies to influence their legal system. I haven't found such a jurisdiction yet. Some are kinda-sorta close (but not really).
There's a distinct line between what you do with the data you scrape vs. writing the tools and code to build a script that will get you that data.
You don't arrest the kitchen knife company's CEO because someone used it to stab someone. And the fact that the web scraping/crawler vendors have saturated the market is testament to the fact that you are overreacting.
I'd imagine the fear of facing a devastating legal battle and how it might have permanently shifted your view on web scraping but I see no valid basis for all web scraping services and vendors to shut down.
I also find it puzzling you would be acting against your interests to continue to openly talk in details about a legal situation like this because that padmapper guy pretty much just shut up as soon as the details were involved.
but feel free to provide more details that shows that you were running a web scraping service or software company.
Big companies don't send polite emails asking you to pretty please stop. They just let their lawyers deal with the whole kit and kaboodle.
Scrapinghub's existence depends essentially on luck; first, that they won't get sued, and second, if they do get sued, that they'll get a sympathetic judge who will find that no contract was entered due to insufficient notice. That is not likely due to the nature of scrapinghub's operations (see Register.com v. Verio). The fact that some people are able to scrape and get away without being sued doesn't change the legal reality or the dubiousness of investing in a company with such a large risk profile.
The knife CEO analogy fails because CFAA claims are NOT about how the data is used. They are about the method used to obtain the data. The entity exceeding authorized computer access or accessing a computer without authorization -- in this case, that is scrapinghub, kimono, et al -- is the entity that has committed the violation of the CFAA. In your knife analogy, if the knife company had illegally acquired the metals used to manufacture the knife, it would be the culpable party, not the end user that bought its knives. The data that scrapinghub goes out, obtains, crafts and packages according to customer specifications ("make this page on craigslist a CSV file that auto-updates every 5 minutes") is the metal that the knife company goes out, obtains, crafts and packages according to customer specifications ("make this metal a sharp cutting utensil").
The person using the data that results from CFAA violations may be doing other illegal things, but in almost all of these types of cases, they're not violating the CFAA if they're not the ones accessing the computer that supplies the data.
I'm really not sure what you're arguing about anymore. The CFAA isn't a real law because the person gathering the data isn't necessarily the one putting it to use? I don't understand.
Got a source on that? Google scrapes all the time. This is how they index all the pages it discovers.
The only real scenario I recall is 3tap vs craigslist but they just kept scraping craigslist even after they banned their IP addresses with multiple proxies.
Then there are airplane ticket websites scraping each other and getting into hot waters.
Having said that it's not a clear cut definition as you'd like to put it. CFAA ruling was only because craigslist felt directly threatened by Padmapper which relied on 3taps.
This really is a shitty shitty business model. All that work 3taps did for you guys and they take all the heat? I don't know why 3taps didn't just comply, was PadMapper 100% of their business?
I definitely think that on the outset, it looks weird that 3Taps ended up taking PadMapper's heat, but I think that 3Taps wanted to become a generalized thing-as-a-service vendor. It's possible that PadMapper wasn't 3Taps's only customer for the CL feeds. As PadMapper wasn't contacting CL's computers without authorization, it makes sense that CL had to change the target to 3Taps. At that point, PadMapper would've seen that scraping CL meant a near-impossible legal challenge for a startup and been wise enough not to implement their own solution.
This is all just speculation, but I doubt that 3Taps stuck its neck out for the sole benefit of PadMapper.
I like this feature because it impedes the rapid nesting of conversations, and also allows the author time to edit his reply before anyone can address it.
The source is the CFAA, which makes it a crime and/or a tort to commit any "unauthorized" access to a computer system. Because authorization is not defined in the statute, it's a matter of interpretation whether or not one's use is unauthorized. Historically, judges have strongly disfavored scrapers.
Because there is a lot of grey area around what may or may not constitute "unauthorized" access to a computer system, if a company does bring a tort claim against you for accessing their system without authorization, you might actually win -- if you can afford the time and money to fight them for the minimum 3-5 years it'll take your case to resolve. This is hundreds of thousands in legal fees easy.
3taps eventually had no choice but to give up because they couldn't take the legal costs anymore, and Power Ventures tried to stick it out and ended up not only being held liable for $3 million in damages to Facebook's systems when no actual damage had occurred at all, but the veil was pierced and the founder held personally liable. It's obvious from the court documents that he was struggling to afford counsel, and companies must be represented by an attorney, so he didn't even have an option to try to represent himself.
>Google scrapes all the time. This is how they index all the pages it discovers.
Yes, Google's operations are, strictly speaking, illegal on various fronts. They depend heavily on automated access, which many sites they index explicitly forbid and thus Google is committing "unauthorized access" to these computer systems, and they also store complete copies of the site and the individual images displayed on the site, virtually all of which are protected by copyright, and all of which constitutes flagrant violation of copyright law.
If someone did bring a CFAA claim against Google for this (which no one would, because Google is one of the wealthiest companies in the world, and it'd therefore cost tens of millions to sue them), Google would likely argue that robots.txt is the only authorization it is obligated to observe, which may or may not be an effective argument. Google also make no guarantees about the extent to which it obeys robots.txt; it's a way to signal your desires to Google, which it may or may not honor.
tl;dr The very short answer to all of this is that traditionally, the legal system has been extremely suspicious of scrapers and has treated them very badly, applying concepts intended for the physical world like trespass to chattels to server access. This has been improving somewhat in recent years, but is still a very financially and legally precarious situation in which to find oneself. The people who get away with it get away with it because no one sued them before they were too big to sue.
It's illegal to break the CFAA whether the plaintiff specifically tells you that they think you're doing it or not. If they send a C&D, yes, you'd be wise to comply, but that's not going to absolve you from claims that you harmed their company by violating the CFAA before they sent it (which do happen and are usually claiming a pretty ridiculously silly amount of damages for something as innocent as downloading a web page from their server). You'd have to argue in court that your access was authorized and they'd have to argue that your access wasn't authorized. The judge and/or jury would then evaluate.
I don't think so. The CFAA states:
>Whoever intentionally accesses a computer without authorization or exceeds authorized access, and thereby obtains information from any protected computer shall be punished as provided in subsection (c) of this section. (a)(2)(C)
It defines a "protected computer" as:
>...the term "protected computer" means a computer which is used in or affecting interstate or foreign commerce or communication, including a computer located outside the United States that is used in a manner that affects interstate or foreign commerce or communication of the United States; (e)(2)(B)
As the Supreme Court has ruled that virtually anything in the United States is subject to the Commerce Clause, this comprises practically all computers, especially after you consider that usage of a computer network almost certainly takes your traffic out of state. Many states have corollary laws to the CFAA with substantially similar language, so if you can miraculously convince a judge that the computers involved are not part of interstate commerce and that the feds therefore have no jurisdiction, there's a good chance you'll have to contend against a similarly-worded state statute.
I don't see any limitations or exceptions here. If you are accessing a computer in an "unauthorized" manner and obtain information whilst doing so, you have violated the CFAA.
The reason scraping can happen is a combination of lack of technical awareness (both from lawyers about computers and from programmers about law) and the cost of pursuing a lawsuit. Even if you break the law, someone has to take issue with your law-breaking before anything happens; they have to file either a lawsuit or an indictment to get the ball rolling. That some people are able to get away with violating the CFAA without someone registering a formal complaint on the matter has nothing to do with whether or not one has violated the statute.
The only way that scrapers don't violate the CFAA is a liberal interpretation of the term "unauthorized", wherein a judge states that if a computer is advertising and allowing public access, then all members of the public are inherently authorized to access it. I know that several scrapers have taken their cases through the courts hoping that such an interpretation would be given.