If the app features certificate pinning to block MITM eavesdropping through your own proxy, you either use one of the XPosed Framework libraries that removes it on the fly in a process hook, or you decompile the app, return-void the GetTrustedClient, GetTrustedServer, AcceptedIssuers, etc. functions.
If it features HMAC signing, you decompile the app, find the key, reverse engineer the algorithm that chooses and sorts the paramaters for the HMAC function, and rewrite it outside the app. If the key is generated dynamically you reverse engineer that too, and if it's retrieved from native .so files you're going to have a fun time, but it's still quite doable.
All they can do is pile on layers and layers of abstraction to make it painful. They can't make the private API truly private if it requires something shipped with the client.
If they keep changing it up, I'm sure you could automate the decompiling process. The reality is that this technique is security by obscurity at its core, and is therefore never going to succeed.
(This technique is useful not just for scraping data, but for UI testing of your own apps.)
This is totally true, but the original premise was to do it just with a MITM. I was being generous and assuming most apps do dynamic generation of their keys. I'm probably wrong now that I think about it.
If the data owner went through the trouble of encrypting the traffic between it and it's app they have a certain expectation of private communications that you'd better have a damn good reason for violating.
Once the api is reverse engineered, we might be tempted to improve the usability of the application by adding some features (scraping all data). If this hurts the server (huge resource consumption), this becomes unethical and may become illegal.
Reverse engineering occupies a much more ethically and legally grey area than outright hacking because you are fundamentally taking software in your possession and modifying it. There are strong arguments that people should have the right to do this. If can lead to hacking, and it's useful for security research, but it is not in of itself an attack on the application's security (you could make a case that it is an attack on the application's trust model, however).
Now, if the developers relied on the privacy of the API as a form of implicit authorization (i.e. by forging requests from the client I can retrieve another user's data using an insecure direct object reference on a username paramater), and I proceed to do that - yes, that's hacking. You're accessing confidential data in an unauthorized manner, just as you would be if an insecure direct object reference were present on the website. The developers made a mistake in conflating client validation with user authorization, but you've still passed a boundary there.
It is arguable that this is unethical or at least amoral, but if all you're doing is scraping publicly available data using the public mobile API, it is at least legally defensible until the other party sends you a C&D for using their API in an unauthorized manner (so long as you haven't agreed to a TOS by using the mobile API, which really depends on whether and how prominently they have a browserwrap in place). I think the spirit of your point is that someone probably just shouldn't be using an API if they're not authorized to do so, but it's a very important legal and technical distinction to make here that you aren't hacking by reversing the embedded HMAC process.
I recommend you read CTF writeups (there was one hosted on GitHub where a team retrieved the request signing key for Instagram IIRC). Those are usually very tutorial-like, though they tend to take some level of knowledge for granted even if they don't intend to.
The other thing to do is pick up apktool, JD-GUI, dex2jar and maybe IDA Pro, Hopper or JEB and learn them as you you go.
Do note that if you become annoying and/or conspicuous enough, they can use legal force to stop you, and if the case actually goes through the process, you'll almost surely lose. This is true at least in the United States and Europe.
I suppose I can envisage a scenario where the app sends a counter as a request id, and my script will then send the next value of the counter as its id, causing the next request from the app to re-use a counter value and thus fail, but the server API and the app have no way of knowing this is due to my script and not, say, network issues, and therefore it should still not affect my reverse engineering abilities...
Maybe, taking this further, the app could have baked in a certain number of unique, opaque 'request tokens' that are valid for one API call only, and when my script has used all of them it will cease to work, and in doing so cause my copy of the app to become useless, but again, not an insurmountable barrier.
Some apps use private keys baked into the app, but you can usually recover those from memory. Do this using lldb and a remote debug server on a jailbroken iPhone over USB.
Or you can treat the app as a black box and use a phone (or multiple) as a "signing server" to sign API requests before sending them via, e.g. curl or python. To do this you install the app on a jailbroken phone, tweak it to expose an http server on the phone that listens for API requests, feeds them to the "sign API client request" method, and then returns the signed request. This method has the benefit of being resilient to frequent app updates.
As far as I know, only Japan has laws.that allow scraping without explicit permission.
Are you talking about Aaron Swartz? There were complicating factors. Typically, no, scraping is not a criminal matter. If you need a lawyer, however, you should talk to one.
> Automated scraping is unlikely to qualify as fair use in court.
Scraping is not [usually] the problem. It is what you do with the scraped content that is the problem. Additionally, the nature of the scraped content itself is at issue: facts are not copyrightable, for instance, just particular expressions.
Fair use is a hugely, hugely complex topic. If you ever ask someone if something is fair use and they give you a straight "yes" or "no" answer - don't trust it unless they are also showing you a court decision that covers your exact scenario and is valid in your jurisdiction. Any good answer will include a healthy does of "it depends."
The only way to conclusively determine if something is fair use is going through the fair use "run-time" - which is a court decision on the topic. There is no other way - none - to determine if something is fair use and claims to the contrary are false. To be clear: fair use is a judicial determination and it is highly fact specific.
Disclaimer: I am a lawyer, but I am not your lawyer. If you need one, you should get one.
That's how Google got big.
Western world allows for scraping as long as you follow robots.txt. If you don't, it's still not a jailable offence unless you DOS the system.
I once wrote a whole automated test suite for a web app based on this, but nowadays I think it's much more common.
Note, I'm not debating whether this is a good or bad thing, just that in the current environment this is surely a legally dodgy manoeuvre.
* You'll be working with them rather than against them.
* Your solution will be far more robust.
* It'll be way cheaper, supposing you account for the ongoing maintenance costs of your fragile scraper.
* You're eliminating the possibility that you'll have to deal with legal antagonism
* Good anti-scraper defenses are far more sophisticated than what the author dealt with. As a trivial example: he didn't even verify that he was following links that would be visible!
* Possible that target-site.com's owners will tell you to
get lost, or they are simply unreachable.
* Your competitors will gain access to the API methods you funded, perhaps giving them insight into why that data is valuable.
Alternative better solution for small one-off data collection needs: contract a low-income person to just manually download the data you need with a normal web browser. Provide a JS bookmarklet to speed their process if the data set is a bit too big for that.
The cost of the bad scraper was pretty significant. They were hitting us as hard as they could, through TOR nodes and various cloud providers. But the bot was badly written, so it never completed its scan. It got into infinite loops, and triggered a whole lot of exceptions. It caused enough of a performance drain that it affected usability for all our customers.
We couldn't block them by IP address because (a) it was just whack-a-mole, and (b) once they started coming in through the cloud, the requests could have been from legit customers. We eventually found some patterns in the bot's behavior that allowed us to identify its requests and block it. But I'd have been willing to set up a feed for them to get the data without the collateral damage.
First, it's very hard to pull off a DDOS attack using Tor. The most you could get would be less than someone repeatedly pressing refresh every second. This is because if you hit the same domain repeatedly the network will flag and throttle you.
How bad was your server configuration that it would choke if somebody tried to scrape it? Was this running on a dreamhost $10/year server or something? That's the only way to explain it's poor performance. Either that or your SQL queries are unoptimized anyways.
I'm just trying to understand. Unless this was like 10,000 scraper instances trying to scrape your website, I find it hard to believe this story.
Instead of downvoting, why don't you offer rebuttal to what I wrote and post more evidence to support your original story?
Please explain. Why do you think Tor can't provide a user with many RPS?
This is why I find OP's story hard to believe, it doesn't add up.
That said, it's not particularly effective a as a brute-force DoS machine due to the limited bandwidth capacity and high pre-existing utilisation. Higher level DoS by calling heavy dynamic pages is still possible.
The parent didn't specify that the outages were during the period that the scraping was coming from Tor. It's equally possible that it only started affecting availability after they blocked Tor and switched to cloud machines.
All that said, screw people who use Tor for this kind of thing. They're ruining a critical internet service for real users.
I think you missed the second part of that sentence.
I must admit that I worked for a company that did that to scrape a well known business networking site....
I'd add to this:
Do you really need continuing access? Or just their data occasionally?
Pay them to just get a db dump in some format. For large amounts of data, creating an API then having people scan and run through it is just a massive pain. Having someone paginate through 200M records regularly is a recipe for pain, on both sides.
A supported API might take a significant amount of time to develop, and has on-going support requirements, extra machines, etc. Then you have to have all your infrastructure or long running processes to hammer it and get the data as fast as you can, with network errors and all other kinds of intermittent problems to handle.
A pg_dump > s3 dump might take an engineer an afternoon and will take minutes to download, requiring getting approval from a significantly lower level and having a much easier to estimate cost of providing.
When has that ever worked?
That said, if you're some small-time researcher who can't offer a compelling business case to make this happen, then it won't be worth their time and they're likely to show you the door. [Note: my implication here is that it's not because you're small time, but it's because by the nature of your work you're not focusing on business drivers which are meaningful to the company/org you're propositioning].
Edit: Also be warned that if you're building a successful business on scraped personal info, you're begging to be served w/ a class action lawsuit (though take that well-salted, because IANAL and all that jazz).
A giant site, who already had an API & had deliberately decided to not implement the API calls we wanted. I should add that another giant site have happily added API options for us. (For my client, really; not for me.)
An archaic site. The dev team was gone & the owners were just letting it trickle them revenue until it died -- they didn't even want to think about it anymore.
Rather than waiting potentially years for their IT team to make the required changes we can build a scraper in a matter of days.
Web-sites that make it a cluster &&*& to get access to the data do two things. They setup a challenge to break their idiotic `are you a bot?` and secondly it is trivial in most situation just to spin up a vm, and run chrome with selenium and a python script.
Granted I don't use AJAX API or anything like that. Instead I've found developer who nativly have a JSON string along side the data within the HTML to easiest to parse.
Reasons why I've setup bots/scrappers
1) My local city rental market is very competitive and I hate wasting time email landlords who have all-ready signed up a lease.
2) House prices
3) Car prices
4) Stock prices
6) Water Rates
7) Health insurance comparison
8) Automated billing and payment systems.
The goal for web scrapers is to pay as little as possible for as much data as possible.
This is a good idea that has some interesting legal implications (e.g., the target site's network is never accessed by the software, so CFAA claims are likely irrelevant), but probably isn't enough to cover all the bases. I wanted to try something like this before I got C&D'd, but my lawyer informed that doing it after the fact could potentially constitute conspiracy and cause a lot of problems.
I'm not a lawyer.
It allows to run and debug the scraping code without running the spider, right from the CLI.
I use it extensively to test that my selectors (both CSS and XPATH) are returning the proper data on a test URL.
I'm taking on technical debt to access data I don't have programmatic access to.
CSS/Xpath are very fragile. You most likely will be changing them in the future.
I've been able to change what I need, and what it does for me are all the things I'd have to have done myself (caching, parallel requests but throttling per domain, dropping into debug mode, scheduled runs, etc).
Really, it feels more like a skeleton + lots of sensible defaults. The meat of the code will be in the parsing, and so if for some reason I really needed to move away from it then I wouldn't feel particularly tied to it.
Whereas the parsing is less of a value add because you prefer to code the parser yourself so that you have more control?
What about changing the parser and crawler as the websites changes?
What other pain points about scrapy do you have?
When I have a scraping task, yes. It's a set of things that are required each time but also a bit fiddly to get right, and scrapy has solved them well.
> Whereas the parsing is less of a value add because you prefer to code the parser yourself so that you have more control?
These I see as fundamentally having to be things I code as they're the parts that are different each time.
> What about changing the parser and crawler as the websites changes?
Really just a cost of doing scraping. Anything fancier has generally taken more time and created more problems than just assuming an ongoing maintenance cost. The debugging shell in scrapy is very useful for this.
Last big job I did I also built a cache that you could query by time, so all versions of the page seen were stored which was very useful for debugging intermittent problems, and finding page changes. I don't know if scrapy has this in its cache, I don't think so but wouldn't conflict with it.
Genuinely curious what the alternative is
I often feel like web scraping is a commodity without understanding any of the inherent technological complexities and challenges.
Very discouraging field to be in, especially when people claim to have pain but are unwilling to pay very much for it or show appreciation for the effort that goes into it.
edit: thanks for the downvotes. perfect illustration of how innovation is punished and unrewarded in this field.
Otherwise complaining about downvotes is no good either. Some will downvote because of that.
Welcome to pretty much every profession in the world.
> perfect illustration of how innovation is punished
I see no innovation in your post.
The complaining about downvotes was just the icing on the cake, cementing the downvote.
the other alternative is parsing schema.org schemas or other markup
P.S. I wrote a WWW::Mechanize::Query ext for it so that it supports css selectors etc if anyone is interested. It's on cpan.
edit: Especially if your scraping jobs take a LONG time - days and weeks, this stuff is extra handy. Might I add a great debugging environment (scrapy shell), error handling, rate limiting, respecting robots.txt, so much more.
As others have said, managing a project with Scrapy is super easy and highly configurable with sane out of the box settings.
I really enjoyed it this time. It has mostly figured out how to be unobtrusive and it now provides a lot of handy stuff out of the box, like the scrapy shell, the ability to easily retry certain error codes, a built-in caching mechanism so that multiple iterative runs are semi-reasonable, and an intelligent crawler that automatically ignores duplicate links from the same session. Compatibility with things like Scrapy Cloud is a nice bonus.
It's worth a try if you haven't looked at it in a while.
For my last project I used a chrome plugin which controlled the browsers url locations and clicks. Results where transmitted to a backend server. New jobs (clicks, change urls) where retrieved from the server.
With tools like Chrome headless this should now be possible, right?
Sometimes this can be a PITA though, for example Tableau obfuscates the JSON they send back, so it's easier to use Selenium to wait ten seconds and then scrape the resulting HTML.
It's not an open source, but free up to 10k pages per month. And it can handle modern JS web applications (your code runs in a context of crawled page). You can for example scrape API key at first and then use internal AJAX calls.
There's also a community page  where you can find and use crawlers made by other users.
Typical use is an aggregator that needs common API for all partners who are not able to provide it. So they have running API on Apifier in an hour. It might break once in a while - than you have to update your crawler (not that often if you use internal AJAX calls).
I feel like it's a hard sell to enterprises. Scraping is viewed inferior to an API so it makes sense for enterprises to just pay the target website for access to the data.
But you're right it's a hard sell to enterprises although we have some (e.g. real estate developer creating pricing maps)
I use Google Chrome on https://urlscan.io to get the most accurate representation of what a website "does" (HTTP requests, cookies, console messages, DOM tree, etc). For Chrome, this is probably the best library available: https://github.com/cyrus-and/chrome-remote-interface. Headless is working as well, but still has some issues.
OCR works well for certain scenarios where UI is fixed like on desktop applications but it's still fragile very much like CSS and Xpath selectors.
In fact, often OCR performs far slower and less accurate than CSS/Xpath selectors.
It has it's niches but I think it's sub optimal for web automation/scraping.
Runs a little headless browser.
I used 1 week to selectively go through accommodations manually, then proceed to complain to a friend of mine.
She's barely human, and she found literary one-of-a-kind apartment dead center at a good price. The apartment was mine next day.
Human scrapers man.
Fairly simple and it people seem to find it really useful.
- to end the main thread only if all tasks are done
- when every running task can produce multiple new tasks
- with limiting the maximum number of running threads
- always running the maximum nubmer of threads if possible
semaphores to the rescue
I don't understand what's wrong with downloading the information that is published on a public web server. That is what that server was made for in the first place.
Of course there are people who scrape the website, add advertisement and optimize it to rank better than the original website. But this problem can solved with other measures.
And of course those who do the scraping must limit their request rate so the server doesn't get overloaded.
Furthermore, the fact that the server gives 200 responses is not sufficient implied permission IF a no-scraping policy has been communicated in some other way such as robots.txt or (clearly communicated) TOS.
The techno-nihilist argument of "i'm just throwing bits on the wire, they're choosing to respond" is pretty unconvincing in a world where intent and context matters.
I think that the only thing that has to be regulated is the load (number of request per unit of time) on the server. So that it doesn't prevent server from serving pages to other visitors.
Regarding robots.txt, I am not sure if ignoring it should have legal consequences. It is just a hint to robots, don't visit these pages or we will ban you.
If there is no harm done to the server I don't see any problem with scraping.
Or, say, returning "403 Forbidden" or using a CAPTCHA?
If you don't want people stealing your music, don't put it other people's hands. The minute you release it to the world, it can't be reversed. See streisand effect.
Likewise, you cannot place burden on your visitors to read and analyze ToS with their lawyers and submit an official request via fax.
If you don't want people access your server, put it behind a paywall or a login screen at least so that you can easily ban people who don't play by your rules.
Otherwise, you have no excuse. If it's publicly accessible, then you cannot enforce any legally binding agreement as you've left the front door wide open and expect people to read tiny letters you put in front of your mailbox.
ToS is a one way street.
"How was I supposed to know they didn't want me to scrape it? I'm just an innocent passerby dropping bits on a wire" is bullshit in many, many cases. You do know, or at least could easily find out if you wanted to, but choose to maintain a fiction of ignorance to avoid responsibility.
Sure, if someone makes it difficult or arcane to read and understand their TOS, you're probably not (morally) bound by it. But if you close your eyes and plug your ears, you don't have much of a leg to stand on.
And I don't see how electronic ToS have any legal power. It is not a contract one has signed before visiting a site. I think ToS should only bind the website owner, not its visitors.
It's unenforceable and has never stood up in court.
If you have the cash to take someone to court but you are going to be on the losing side.
ToS is not a mutually binding agreement. Even if you have a checkbox that says "I have read the terms", it's often thrown out because nobody expects you to read EULA down to the letter.
The fact that it does (or, at least, has been interpreted to) criminalize TOS violations is one of the must important flaws in the Computer Fraud and Abuse Act.
All people that landed a hand in this man's death should be charged with manslaughter.
Truly a shame and how draconian laws end up killing the innocent in the United States.
That argument often turns into "But I probably won't get caught" which is a similarly weak defence.
Edit: On second thought, I guess you are referring to overcoming 403s and Captchas?
We are getting small wins, but it's going to be slow going until we can get Congress to adjust both the CFAA and the Copyright Act, or until we can get SCOTUS to seriously alter the way these acts have been interpreted with reference to internet access.
Basically, all damage claims are null because QVC & Resultly never entered into a mutual agreement. You can write whatever the fuck you want in your ToS but it's not law binding.
> Judge Beetlestone also rejected QVC’s claims that Resultly violated the Computer Fraud and Abuse Act by knowingly and intentionally harming the retailer when Resultly caused the shopping network's website to crash, reasoning that the tech company and QVC both could only earn money if the site was operational.
I see you are back on the FUD train surrounding web scraping but there's only very specific case where your fears materialize: "When you receive C&D from said website, do not continue scraping". Such was the case for Craigslist vs 3Taps.
Please do not cite legal resources and grossly twist realities to spread FUD. If you don't want to be web scraped, simply do not put it online.
I'm pretty sure that's what I said re: the ToS? That's only one element of the case (breach of contract). You are correct that in this case, browsewrap was not considered applicable. There have been a few other cases where it wasn't too, as in Nguyen v. Barnes & Noble, Inc., but there have been cases where it was, as in Hubbert v. Dell Corp.. Also note that most cases re: browsewrap do not challenge the viability of automatically entering an agreement by clicking around the site, but rather argue that the notification was simply not prominent enough. It could be worked around by moving the notice into a more prominent location on the page.
The other element is CFAA, and referring to the wide-open robots.txt helped Resultly establish that they were attempting to act in good faith and were not maliciously damaging QVC's systems and not exceeding authorized access to the computer system.
>Please do not cite legal resources and grossly twist realities to spread FUD. If you don't want to be web scraped, simply do not put it online.
I'm not trying to spread FUD, I'm just trying to make it clear that the legal situation is precarious. Google has clearly shown that if you are able to build your coffers and reputation faster than you can incur lawsuits, you can win on this. In fact, lots of big companies begin that way, and become big companies merely because they were lucky enough to get big enough to stand up for themselves before the legal threats started coming in the door.
It's understood that you'll be scraped if you put it online. That doesn't mean scraping is legal.
You may be confused here -- I'm not a publisher trying to stop people from doing this. I'm an entrepreneur whose business depended on scraping data from a specific source. That business got destroyed when they chose to dispatch their law firm against us.
The point of repeatedly discussing this on HN is to make the legal situation clear so that people work to change it, and to make sure people who are going into similar ventures are informed about the legal risks associated with them.
As I said on another post, I am not a lawyer, and this is according to my layman's understanding. No one should misinterpret my posts as legal advice. I'm not going to copy and paste this disclaimer into every post I make because it should be implicit, and a few IANAL disclaimers is plenty.
My criticism was that you mixed in service providers and tool providers that enable businesses to make money off scraped data-the vendor cannot be held responsible for misbehaving clients, the best it can do is cut them out when requested by external parties. Toyota doesn't appear as witness to vehicular man slaughter cases. It's a car to take you from A to B but it's not Toyota's fault if the customer runs over something between those two points and not it's intended design (QVC vs Resultly).
It also doesn't help that there are pathological web scrapers who simply does not have the money to do anything fruitful so they will bootstrap using any means necessary and plays the victim card when they are denied. This particular group is responsible for majority of the litigations. People who otherwise have no business by piggybacking off somebody else using brute force to bring heat to everyone involved.
Developers presumably scrape websites because the data is of some value to them, frequently commercial value. Google's entire value proposition is based on scraped data, and it's one of the most valuable companies on the planet. The way the data is used is not necessarily relevant to whether the act of scraping a web page violates the law or not -- several more basic hurdles involving access, like the CFAA and potential breach of contract depending on whether the facts of the case are such that the court holds the ToS enforceable, have to be overcome before the matter of whether one is entitled to utilize the data obtained becomes the hinge.
>My criticism was that you mixed in service providers and tool providers that enable businesses to make money off scraped data-the vendor cannot be held responsible for misbehaving clients, the best it can do is cut them out when requested by external parties.
3Taps is one of the most prominent such cases and it was just the type of tool that you're claiming wouldn't be held accountable. 3Taps's actual client was PadMapper, but since 3Taps was the entity actually performing the scrape, they were the party that was liable for these activities.
The lesson we've learned from 3Taps is that scraping tools might be OK if they strictly observe any hint that a target site doesn't want the attention and cease immediately, but there's really no guarantee either way.
Most people won't sue if you adhere to a C&D, not because they couldn't do so and win, but because it's much cheaper to send a C&D and leave it at that, as long as that settles the issue moving forward. Litigation is very slow and expensive.
It was a poorly executed business strategy because they were up against powerful legal team.
Best thing to do if you receive C&D or requests to stop scraping, best to not continue and just let that customer go.
In fact, it's assumed that defendants weren't intentionally breaking the law, which is why when it's clear that they were, courts triple the actual damages for willful violations. 
If a reasonable person wouldn't realize that they were "exceeding authorized access", that probably limits a potential CFAA claim, but that's it, and that's not only the potentially perilous statute when you're a scraper. In the QVC case, Resultly got lucky that QVC did not have an up-to-date robots.txt; otherwise, they very well may have been on the hook for multiple days of lost online revenue, despite their immediate cessation upon receipt of a C&D.
Again, you are more than welcome to take your perspective and run with it, and it's plausible that no one will get mad enough at you to sue over it. That doesn't change the law.
I would assume that 3Taps pursued this litigation not because they had special love for PadMapper, but because they felt it was important for their business to be allowed to scrape major data sources and thought they'd be able to win. Pretty sure Skadden was their law firm so they gave it an earnest try, but ultimately lost.
Relevant cases are Craigslist v 3Taps, Facebook Inc. v Power Ventures Inc., and several others. This is at the point where it's basically well-established. The exception is Perfect 10 v. Amazon, where judges ruled that since it was Google and they don't want to break Google, it's OK. Copyright law allows such evaluations because each judge must decide whether a use was "fair" or not.
And copyright laws are supposed to protect only creative works, not every page on the web.
>And copyright laws are supposed to protect only creative works, not every page on the web.
Copyright law protects all works of sufficient originality. Pretty much the only thing it doesn't protect is a plain list of facts (and in the European Union, it even protects that, known as "database rights"). The minimum standard of originality for copyright protection applies to practically every page on the web, yes.
In effect, this means that you can copy a list of names and addresses from a phone book, but you can't copy the layout. Since you can't access a web page without making a copy in RAM, if the publisher has revoked your license to access the content, even accessing and extracting the raw factual information within the body of the page is an infringement (because your RAM copy is an infringing copy).
IANAL and this is based on my layman's understanding.
So if a website sends you a letter to stop, do not continue scraping them. Until that happens, it's fair game. ToS is not legally binding. Those two cases were the result of damage caused to the website.
STOP. SPREADING. FUD. cookiecaper. All of your comments are the same unsound legal advice yet Mozenda, Import.io, and a whole bunch of tools & service providers are humming along just fine.
Disclaimer: I'm not a lawyer, this is not a legal advice, consult a real lawyer and not random HN comments.
(Technically they settled.)
Read up on that case and you'll see that even absent a C&D, the judge reasoned that Craiglist's IP ban against 3Taps was a separate incident of affirmatively communicating its intention that 3Taps refrain from accessing the site:
> The calculus is different where a user is altogether banned from accessing a website.
> The banned user has to follow only one, clear rule: do not access the website. The notice
> issue becomes limited to how clearly the website owner communicates the banning. Here,
> Craigslist affirmatively communicated its decision to revoke 3Taps’ access through its ceaseand-desist
> letter and IP blocking efforts. 3Taps never suggests that those measures did not
> put 3Taps on notice that Craigslist had banned 3Taps; indeed, 3Taps had to circumvent
> Craigslist’s IP blocking measures to continue scraping, so it indisputably knew that Craigslist
> did not want it accessing the website at all.
>So if a website sends you a letter to stop, do not continue scraping them. Until that happens, it's fair game. ToS is not legally binding. Those two cases were the result of damage caused to the website.
No, that's incorrect. You can believe this all you want, and I truly hope that it all goes well for you. It's totally possible that you will never piss off someone who has the resources to file a lawsuit over it. But you should know that you can be bound by a browsewrap agreement (and that it's quite easy to be bound by a clickwrap agreement, which, in practice, may not be much different and which scrapers may automatically follow (e.g., a link that says "click here to enter and agree")).
It's really going to come down to the judge's belief that the notification is adequately prominent and that a "reasonably prudent user" would be aware of the stipulations.
Quoth from Nguyen v. Barnes and Noble (citations removed):
> where, as here, there is no evidence that the website
> user had actual knowledge of the agreement, the validity of
> the browsewrap agreement turns on whether the website puts
> a reasonably prudent user on inquiry notice of the terms of
> the contract. [...] Whether a user has inquiry notice of a
> browsewrap agreement, in turn, depends
> on the design and content of the website and the agreement’s
> webpage. Where the link
> or tucked away in obscure corners of the website where users
> are unlikely to see it, courts have refused to enforce the
> browsewrap agreement. [...] On the other hand, where the website contains an
> explicit textual notice that continued use will act as a
> manifestation of the user’s intent to be bound, courts have
> been more amenable to enforcing browsewrap agreements.
> [...] In short, the conspicuousness and
> design all contribute to whether a reasonably prudent user
> would have inquiry notice of a browsewrap agreement.
>STOP. SPREADING. FUD. cookiecaper. All of your comments are the same unsound legal advice yet Mozenda, Import.io, and a whole bunch of tools & service providers are humming along just fine.
I'm not giving any legal advice as I'm not a lawyer. For the third or fourth time here, this is all according to my layman's understanding. It's based on things I learned that time I had to close my business or face a lawsuit from a massive company over just such issues.
It's crucial for companies that provide scraping services to be aware of these issues and I know of at least one such company who is aware of them and who takes several precautions to provide some distance from potential legal liability, though they are still not 100% out of the woods. As is usually required in entrepreneurship, they're taking a calculated risk. Should someone wage a legal challenge against their activity, they have millions of dollars in the bank from investors who presumably have researched this and are willing to accept the cost of the potential legal liability.
I'm not saying that people shouldn't make businesses that depend on scraping data. I just think they should know what they're getting into before they do so.
You are correct that some businesses have been able to engage in such activities without being sued out of existence up to this point. Unfortunately, that doesn't mean that others will be as lucky.
I fully agree that anyone who is seriously interested/concerned about this should ask a lawyer. I certainly did. Their answers were not good news for me. Maybe they will be for you.
I would guess that most judges would not be charitable to someone pretending that they've circumvented this by rotating through proxies pre-emptively. In fact, this would likely work against the defendant as it'd be evidence of willful infringement, which is typically 3x damages. If you can convince the judge that you were just incidentally rotating IPs to protect privacy or something, you might get away with it, but it's definitely not a simple answer, and that still only gets you up to the point of receiving a C&D.
I'm not sure why you're trying so aggressively to mislead people about the legal precariousness of data scraping, but at this point it should be clear that this is not a simple matter and it's not something to approach lightly or dismissively.
For scraping-related activities, Scrapinghub would probably be the party sued, as was the case in 3Taps, though the clients could probably also be legitimately sued for various things, most obviously copyright infringement.
Again, I'm really not sure what you're getting at here. Yes, it's a great idea to check with a lawyer and assess your potential legal exposure. That's why lawyers exist! You can then ask them questions, as Scrapinghub surely has, about how to minimize that potential legal exposure. You definitely SHOULD do that, especially since scraping is more or less illegal in the United States.
Courts frequently use an analogy to private physical property to address the matter of accessing a web site. Running a business based on scraping until someone sends you a C&D is roughly the same as running a business based on trespassing on private property until someone serves you with a no-trespass order.
Maybe it will work out fine, and most of the time, as long as you leave the property promptly upon request, you probably won't have an issue just because there's no benefit in dragging the matter out further. But that doesn't mean there isn't legal risk involved in running such a business, nor does it mean that you won't be liable for damages incurred whilst trespassing.
In such a case, questions about whether the borders of the property were clearly delineated, whether "No Trespassing" signs were posted, whether a reasonable person would've understood they weren't allowed to be there or not, etc., would be asked to determine the existence and/or extent of the trespasser's liability.
In the same manner, there is substantial risk involved in running a business whose primary function is to scrape websites, and the same types of questions would be (are) asked in a court case related to network access. People deserve to be informed of that.
That's not FUD, it's just the law. If you don't like it, well, most people who know what they're talking about don't either, but that doesn't change the law. Saying "$Party_X hasn't been sued over it!" also doesn't change the law or make the process any less legally risky.
If you find this arrangement unsettling or absurd, as you obviously do, I would suggest that you direct your energies/attention to your local representatives, the EFF, and other types of political activism that may help rectify the situation rather than accusing HN commenters of spreading FUD.
When you do something illegal, you probably won't get sued for it, because it costs a ton of money to sue someone and it's not likely that you're annoying anyone enough to justify that. This is especially the case if you back off at the first sign of annoyance. That's as much as we can say for your angle.
If you're comfortable basing a business on that, be my guest.
Businesses that utilize web scraping to achieve business goals at a direct expense of another business will get you in trouble not because of web scraping but simply trying to create competition. Businesses with a large cash use litigation to snuff out competition because their businesses are largely undefensible without such forceful litigation ex. craigslist would not exist if they let anyone scrape them.
Businesses that build and sell web scraping sevices and tools are less likely to be impacted for the same reasons if they comply with formal requests to stop scraping. 3Taps received notices beyond just IP ban (this alone does not set enough of a context) but they chose to ignore it and continue on. 3Taps had enough of a financial motivation on the line to put out their neck for their customer, PadMapper. Pretty fucking stupid if you ask me, no one customer is worth risking the entirety of your business operation.
It's far more likely that the law exists to serve those who exploit it to protect their business interests. Generalizing and extrapolating based on a few court cases with their own dynamic set of variables and exceptions as fact is dangerous advice.
I just want to warn people reading your comments not to take it word for word as the reality is far far less legally hostile-you are too small for people to go after and not an existential threat to the target website.
The argument that web scraping puts strain on web servers is a pretty laughable defense. Craigslist alone gets millions of hits every day but can't serve pages requested by a python script? 3taps fucked themselves because they took money AND they put their neck out for their customer.
That's the lesson here, don't risk your entire business for one customer. It's not fair to the rest of your customer base.
It should be, because I've stated it probably 6 times in this thread.
>someone just overly reacting to perceived legal liabilities by simply generalizing court cases and attempting to reach a conclusion that tries to fit everyone.
So for the seventh time, I'm not a lawyer, but isn't this how it works when questions about legality are posed? It's always based on the relevant statutes and the case law interpreting and applying those statutes. I mean, correct me if I'm wrong.
I'm glad you're not worried about someone looking at the case law and making a generalization about how it applies to the field.
If you want specific (i.e., non-generalized) legal information, you always need to discuss your individual affairs with a licensed attorney who is knowledgeable in the field and jurisdictions in which you'll be operating.
In practical terms, web scraping is usually illegal in the United States. In this case, that doesn't mean there's a law that says "web scraping is illegal", it means that there is a small group of laws, which, taken together, make it virtually impossible to scrape web pages with confidence that you're not getting exposed to potentially serious legal liability. Note that "illegal" is not the same as "criminal", but that the CFAA does provide for criminal penalties (and Aaron Swartz was being prosecuted under them for scraping research papers out of an academic database).
>Businesses that utilize web scraping to achieve business goals at a direct expense of another business will get you in trouble not because of web scraping but simply trying to create competition.
You're talking about the likelihood that a business will get sued by someone. That's great, but it doesn't change the legal status of the activity that someone is unlikely to sue you for.
My business did not directly compete with anyone. Everyone thought it primarily helped the data sources we used. People always told me that they were shocked that the company that was making the threat was upset about it. Even my lawyer said it seemed unusual and couldn't figure out what their underlying motive was.
The stakes are an important consideration, but yes, it is important to consider the impact if you do get sued/threatened by an unlikely plaintiff.
>3Taps received notices beyond just IP ban (this alone does not set enough of a context)
The 3Taps ruling casts doubt on the suggestion that an IP ban is itself insufficient notice. That issue hasn't been decided directly afaik, but the reasonable conclusion, if you are getting a 403 or a page that explicitly informs you your IP has been banned when you access a site, is that they are trying to keep you out and that further access likely violates the CFAA.
>3Taps had enough of a financial motivation on the line to put out their neck for their customer, PadMapper. Pretty fucking stupid if you ask me, no one customer is worth risking the entirety of your business operation.
That's definitely the risky side of the equation. The alternative side was that they'd win and be allowed to retain access to one of the largest data sources on the internet, and preferably set a precedent that allowed them to continue to scrape big data sources without concern moving forward. That gamble clearly did not pay off for them, but that doesn't mean it wasn't a reasonable gamble to take.
>It's far more likely that the law exists to serve those who exploit it to protect their business interests.
I agree, but I don't see how it's relevant. Lots of people believe that it's beneficial to their business interests to use the legal system to bully people who can't afford to stand up for themselves. Uh, congrats to them I guess? Why are you saying this like it's a normal thing? We should take steps to minimize the surface area that can be used for that.
If you're suggesting there is a small handful of bad guys to whom these laws need to apply, that's fine and I actually agree with you, but that means we need to fine-tune the law so that it only covers the bad guys, not virtually everyone if someone you're scraping is having a bad day.
You keep fighting this fight pretending like I'm saying something that's incorrect, and then you just come back and say that it doesn't matter because a) some people who scrape have not been sued; and b) people who start scraping business may not get sued if they adhere to the requests of those who politely ask them to stop. That's great, but it's neither here nor there. This is about what the law is, not whether you're going to be sued personally.
>Generalizing and extrapolating based on a few court cases with their own dynamic set of variables and exceptions as fact is dangerous advice.
It's all anyone can do when you're dealing with an emerging area of law, afaik.
>I just want to warn people reading your comments not to take it word for word as the reality is far far less legally hostile-you are too small for people to go after and not an existential threat to the target website.
Yes, this is another thing I've stated multiple times. You probably won't get anyone mad enough at you to sue you. But you should know where you stand if you do. And you should try to fix the law in the meantime.
>The argument that web scraping puts strain on web servers is a pretty laughable defense.
Plaintiffs use this argument all the time and get injunctions filed on that basis regularly. Even if the defendant is not disruptive, judges say they need to issue the injunction or it will invite a pile-on effect that will be disruptive. Thus, they grant an injunction under a trespass to chattels doctrine, generally putting legal force behind a C&D.
>3taps fucked themselves because they took money AND they put their neck out for their customer.
3taps fucked themselves only because they tried to stand up and win the case. Perhaps it would've been better for them to try to lobby Congress instead and get the law transformed into something semi-reasonable, though it's likely they recognized the futility in that.
>That's the lesson here, don't risk your entire business for one customer. It's not fair to the rest of your customer base.
It seems like the lesson is that web scraping is legally precarious, and that if you're not careful about it, you can end up in a lot of hot water.
You keep acting like that's an absurd conclusion, but not really showing anything to discount the onerous outcomes that entrepreneurs in this space have faced. 3Taps is not the only case where this has been addressed.
In Facebook v. Power Ventures, the corporate veil was pierced and the entrepreneur was left with $3 million in personal liability, all for trying to create software that made it easy for a user to save their own data only out of Facebook. Facebook acknowledged that it did not have any copyright interest allowing it to forbid Power from accessing that data specifically, but they continued to pursue copyright claims based on the RAM copy of the Facebook site from which the content was extracted.
The point is that the current law makes scraping a perilous exercise. Perhaps you won't have problems, but that's probably only the case if a) you stay so small no one will ever target you or b) you know the law and you take extra precautions to protect your business so that any accusations of wrongdoing are clearly invalid against current law. Scrapinghub is trying to do this, but IMO it's insufficient if they get an aggressive/hostile litigant.
The truth is that Scrapinghub et al are on the precipice and they're going to stay there until precedent changes (likely through a SCOTUS override, particularly one overturning the RAM copy doctrine, which is probably plausible, and one putting constraints on the ability to revoke access to public web sites under the CFAA, which is probably not) or until the law changes. They only need to get hit with one well-placed lawsuit and they'll be goners.
You can argue til the cows come home about how they won't get sued because they stop once they get a C&D, but that's not necessarily true, and that doesn't fix the laws around scraping.
> The copies of webpages stored automatically in a computer's cache or
> random access memory ("RAM") upon a viewing of the webpage fall within
> the Copyright Act's definition of "copy." See, e.g., MAI Systems Corp.
> v. Peak Computer, Inc., 991 F.2d 511, 519 (9th Cir. 1993) ("We recognize
> that these authorities are somewhat troubling since they do not specify
> that a copy is created regardless of whether the software is loaded into
> the RAM, the hard disk or the read only memory (`ROM'). However, since
> we find that the copy created in the RAM can be `perceived, reproduced,
> or otherwise communicated,' we hold that the loading of software into
> the RAM creates a copy under the Copyright Act.") See also Twentieth
> Century Fox Film Corp. v. Cablevision Systems Corp., 478 F.Supp. 2d 607,
> 621 (S.D.N.Y. 2007) (agreeing with the "numerous courts [that] have held
> that the transmission of information through a computer's random access
> memory or RAM . . . creates a `copy' for purposes of the Copyright Act,"
> and citing cases.) Thus, copies of ticketmaster.com webpages
> automatically stored on a viewer's computer are "copies" within the
> meaning of the Copyright Act.
I'm increasingly feeling that the law is giving way too much control over content published on the Internet to the publishers.
Similarly, concepts like the "first sale doctrine" are becoming less applicable with digital delivery, as it's impossible to identify a "hard copy" of something that may be eligible for resell. That completely obliterates the secondary market for many products that are accessed through computers, including software, games, movies, and books.
The CFAA essentially allows network operators to arbitrarily make someone a felon overnight. Reddit co-founder Aaron Swartz is the most prominent example of this; his criminal prosecution under the CFAA (for scraping publicly-funded research papers out of a database) was pending when he committed suicide.
We badly need digital rights reforms, but since major companies have been allowed to profit handsomely off these shifts and since they find it rather convenient to bully small innovators with serious legal threats, which are easy to craft in this climate, it doesn't seem that anyone is making this a priority.
To clearify: I lived in some cheap third world cities, and i can certanly see that solving captchas could finance a rather nice life.
I think part of it is how you crawl (phantomjs, for example, seem to hit captcha almost every time), but things like ip&proxy usage could make this trigger more often.
The economics are in their favor, and I make it a point not to fight economics when I recognize them, it's rarely sustainable.
The captcha was unusually simple to solve, in most cases the best strategy is to avoid seeing it in the first place.
I'm wondering how they were misunderstood, given they even used quotations to show they were quoting the article which is the one that made the claim that it was "unfortunately spelled".