OKCupid did a DMCA takedown for researchers releasing scraped data: https://www.engadget.com/2016/05/17/publicly-released-okcupi...
Since both of these incidents, I now only scrape if it's a) through the API following rate limits or b) if there is no API, and the data has the explicit purpose of being shared publically (e.g blogs), I follow robots.txt. Of course, most companies have a do-not-scrape clause in their ToS anyways, to my personal frustration.
(Disclosure: I have developed a Facebook Page Post Scraper [https://github.com/minimaxir/facebook-page-post-scraper] which explicitly follows the permissions set by the Facebook API.)
If you don't think this is reasonable, chances are you've never run a large website, or analyzed the logs of a large website. You'd be astonished how much robotic activity you'll receive. If left unchecked it can easily swamp legitimate traffic.
Unless you have a way for me to automatically identify "honourable" scrapers such as yourself as distinct from the thousands upon thousands of extremely dodgy scrapers from across the world, my policy shall remain.
And while in the US that may "just" be treated as unauthorized access, in the EU, if you make the data public it's also a violation of the Data Protection Directive, putting you at risk of prosecution in every EU country from which you have included data.
You may be right from a risk minimisation perspective. But for a lot of data the risk in the case of exposure is low enough that it is a totally valid risk management strategy to assume that legal protections will be a sufficient deterrent to prevent enough of the most blatant abuses.
As a private individual it's not hard to comply either, for private use. If you publish it, it becomes a different story, because it's PII. And, as soon as it's in possession of a company, they need to comply with more rules about securely storing it, etc. (this isn't enforced very well, though). Private individuals can't be held to that because there's (in theory) no legal way to check it.
This is a double standard plain and simple, and a very dangerous one at that.
My point was that when you call something a double standard, you're arguing two things of equal value have been judged differently under the same standard. But by acknowledging they've been judged differently, you're acknowledging that there is a judgement, a standard, that applies the same to both, and produces the results you object to. What you really object to is the fairness of the qualities checked by the standard.
Since the outcome of calling things that, vs calling them a double standard is the same, I think most people already know and have no trouble with this. My protests were worthless.
It could gain value if there were certain whitelisted judgable aspects (like expected value), and judgements that aren't based on things from the whitelist are considered outside the scope of a standard. Then, calling the standard unfair and calling it a double standard would have a different meaning (if only in some contrived way, since any aspect is just an argument away from the whitelist)
Even normal trespassing laws are way too overreaching (see how it is handled in the UK for a saner example) but now you have the amazing possibility of remote trespassing.
The fun part is that it's just a matter of someone hiding something that says you cannot access the site in a place that you have to access the site in order to read -- the ToS. Suing people over this is idiotic.
The real problem is the involvement of Govt, and this kind of absurdity regarding ToS, EULAs and so on, is something that has been going on for decades. If you have the money you can make Govt your personal watch dogs.
If a stranger enters my house without my permission, that's trespassing. But there's nothing unfair about letting in someone who I invite over.
For websites, it's not fair to have different rules for Google than others. What would be fair is some kind of rule about how often visitors can visit, how much they're allowed to download, etc.
Personally, though, I think all this is total BS. Sites are open to the public, but they also serve the whims of their owners. If the site wants to prevent access to people from a certain IP range, that should be their right. If they don't want any scrapers, that should be their right too, or if they want to allow Google and not anyone else, that should also be their right. What isn't right is that they can use the government to enforce these arbitrary rules. If they want to block my scraper, that's fine, if they can do it on their end technologically. If they want to block my IP, they can do that too. But suing me or having the cops come to my door because they're too incompetent or lazy to do these things technologically is unacceptable. The role of government is not to enforce arbitrary policies made up by business owners.
1. Google can come in
2. Other Americans can't come in
3. Chinese people can come in (or anywhere else where US laws don't apply)
It might not be unfair, but it is certainly pointless and arbitrary.
And all smart websites should include a ToS that says you are not allowed to access their data, so they can sue for trespassing anyone that they don't like selectively.
The far reaching of government into this, and also the pirating stuff (which I do not condone but think that arresting people for that is waay too much) is what makes me want for the system to collapse under it's own weight. Like some website suing members of congress for visiting it while violating the ToS in this case.
I also secretly wanted Oracle to win vs Google so that cloning an API was piracy and that would extend to being a crime to purchase pirated goods which would make all clean room reverse engineering a criminal activity. That would lead to anyone that uses a PC without an authentic IBM BIOS (look up Phoenix BIOS) to be arrested, in theory, so even the US president would have to fall into that. It would have been a glorious shitstorm if Oracle won and IBM took that precedent to it's logical implications, the computer world would have failed, and the law would either be made even more arbitrary or be fixed, but at least it would be shown how idiotic the state of affairs was.
your next argument may very well be a very racist one with the very same excuse you used above.
They could scrape your website and then they prevent you form scraping your own data back.
The whole process is silly; it reflects the duct tape and chicken wire nature of the www.
No one should have to "scrape" or "crawl".
Data should be put into a open universal format (no tags) and submitted when necessary (rsynced) to a public access archive, mirrored around the world.
This to bridge the gap until we reach a more content addressable system (cf. location based).
Clients (text readers, media players, whatever) can download and transform the universally formatted data into markup, binary, etc. -- whatever they wish, but all the design creativity and complexity of "web pages" or "web apps" can be handled at the network edge, client-side.
"Crawling" should not be necessary.
No one should have to store HTML tags and other window dressing for data.
To give an example, there is a lot of free open source software mirrored all over the internet, mostly on ftp servers, but also on http, rsync, etc.
If you use Linux or BSD you probably are using some of this software. If you use the www, then you are probably accessing computers that use this software. If you drive a new Mercedes you are probably using some of this software. There are a lot of copies of this code in a lot of places.
Is that centralized? Does anyone hosting a mirror ("repository") "own" the software? Is it the same person or entity hosting every mirror?
Compare Google's copies of everyone else's data, also replicated in a lot of places around the world. Who "owns" this data?
In my experience, on a large site, Google will often slurp as much as you let it, upwards of hundreds of pages per second.
There's definitely a correlation between my sites' Google rankings, their organic traffic, and their crawl rate. The other sites I run are Alexa top 30,000 and top 100,000. They all feature dynamically changing content, but Google is definitely using a higher crawl rate on my higher ranking sites. This isn't a surprise though, Google has limited resources like everyone, and they'll focus those resources in a way that provides the most benefit.
Edit: If you're talking about the correlation between daily ranking and daily crawl rate for an individual site, then no, I'm not aware of any patterns. For example, the graph is flat for organic traffic and total indexed pages, but the crawl rate jumps up and down as mentioned, and it doesn't appear to relate on a daily basis.
Scraping against the TOS is super bad netizen stuff, and I dont think people should be posting positive reviews of people doing this. Breaking captchas and the like is basically blackhat work and should be looked down upon, not congratulated as I see in this thread.
Scraping, in my opinion, isn't black hat unless you are actually affecting their service or stealing info.
If you are slamming the site with requests because of your scraping, yeah you need to knock it off. If you throttle your scraper in proportion to the size of their site, you aren't really harming them.
In regards to "stealing info", as long as you aren't taking info and selling it as your own (which it seems OP is indeed doing), that is just fine.
tl;dr: Scraping isn't bad / blackhat as long as you aren't affecting their service or business.
And do you understand their site infrastructure to know whether you're doing harm? It's perfectly possible that your script somehow bypasses safeguards they had in place to deal with heavy usage, and now their database is locking unnecessarily.
Because we were a specialized browser used by people looking for a very specific piece of data, we could employ caching mechanisms that meant that each person could get their request fulfilled without having to hit the data source's servers. We also had a regular pacing algorithm that meant our users were contacting the site way less than they would've been if they were using a conventional desktop browser.
Our service saved the data source a large amount of resource cost. When we were shut down, their site struggled for about two weeks to return to stability. I think they had anticipated the opposite effect.
Our service also saved our users a large amount of time. We were accessing publicly-available factual data that was not copyrightable (but only available from this one source's site). There's no reason that the user should be able to choose between Firefox and Chrome but not a task-specialized browser.
It is true that some people will (usually accidentally) cause a DDoS with scrapers because the target site is not properly configured, but the same thing could be done with desktop browsers. It doesn't mean that scrapers should be disadvantaged.
And if airline tickets are based on supply v demand, it might even be possible to drive down ticket prices by suddenly dropping a load of blocks near to the flight date.
This can easily be prevented by requiring ID matching the ticket on entry, but the ticket sellers often don't seem to care.
Not even remotely absurd. Where is the data your scraper consuming coming from? It's almost always served from some sort of data repository (SQL or otherwise). That data costs far more per MB to serve up quickly than JS/CSS/images.
Suppose, for example, you host a blogging platform that has one very popular user. Most accounts on your site don't get a ton of visitors, and that one very popular user's post are all stored in cache.
Then along comes a scraper. He thinks, "Hey, this site is serving up a million page impressions a day. It can definitely handle me scraping the site".
But when he runs the scraper, he fills up the cache with a ton of data that it doesn't need, causing cache evictions and general performance degradation for everyone else.
What if you get a normal user who says "Hey, I wanna see some of the lesser known authors on this platform" and opens up a hundred tabs with rarely-read blogs? What if you get 10 users who decide to do that on the same day? Is it reasonable to sue them? Should there be a legal protection to punish them for making your site slow?
Don't blame the user for your scaling issues. If the optimized browser ("scraper") isn't hammering your site at a massively unnatural interval, it's clean. And if it is, you should have server-side controls that prevent one client from asking for too much data.
These are just normal problems that are part of being on the web. It's not fair to pin it on non-malicious users, even if they're not using a conventional desktop browser.
Second, search engines don't always respect robots.txt. They sometimes do. Even Google itself says it may still contact a page that has disallowed it. 
Third, robots.txt is just a convention. There's no reason to assume it has any binding authority. Users should be able to access public HTTP resources with any non-disruptive HTTP client, regardless of the end server's opinion.
 "You should not use robots.txt as a means to hide your web pages from Google Search results. This is because other pages might point to your page, and your page could get indexed that way, avoiding the robots.txt file." / http://archive.is/A5zh8
There's a somewhat related issue where to ensure your site never exists in Google, you actually need to allow it to be crawled, because the standard for that is a "<meta name=noindex ...>" tag, and in order to see the meta noindex, the search engine has to fetch the page.
Feel free to send any request to any server you want, it is certainly up to them to decide whether or not to serve it, but that doesnt absolve you of guilt from scraping someone's site when they explicitly ask you not to.
You are posting in a comment thread underneath my reply about rudeness and impoliteness, ironically being somewhat rude telling me off about what not to conflate when it was never what I said.
Another issue is finding the site's preferred home page. We look at "example.com" and "www.example.com", both with HTTP and HTTPS, trying to find the entry point.
This just looks for redirects; it doesn't even read the content. Some sites have redirects from one of those four options to another one. In some cases, the less favored entry point has a "disallow all" robots.txt file. In some cases, the robots.txt file itself is redirected. This is like having doors with various combinations of "Keep Out" and "Please use other door" signs. In that phase, we ignore "robots.txt" but don't read any content beyond the HTTP header.
Some sites treat the four reads to find the home page as a denial of service attack and refuse connections for about a minute.
Then there's Wix. Wix sometimes serves a completely different page if it thinks you're a bot.
Bandwidth is certainly part of it, but there's also also database and app-server load (which may be the actual bottleneck) that a scraper isn't necessarily bypassing.
So I agree that a scraper isn't necessarily bypassing some load-heavy operations, but I find it highly implausible that a non-malicious scraper would be invoking operations that cause extra load (beyond just hitting the site too often). Frankly, I'd be surprised if there was a functional scraper that regularly invoked more resource cost per-session than a typical desktop browsing session to get equivalent data.
That wasn't my point. My point was: a lot of a website's costs are hidden from a web scraper (e.g. database load), so a scraper can't claim, based on the variables they can observe (bandwidth), that they're costing the website less than normal traffic.
I was basically responding to statements like this:
There's really no way for a scraper to know that unless the website tells them. Their usage pattern is different than typical users and raw bandwidth (for stuff like static images) may not matter to the website.
Your argument is basically boiling down to "scrapers could hit one load-heavy endpoint too fast", but so could desktop browsers. So I don't see what it has to do with scraping.
It does, because scrapers don't have normal usage patterns. They're robots and behave like robots.
> What's the difference between a user clicking the same button on the page 50 times or holding down F5 and a scraper that pings a page once a minute?
Typical users aren't usually in the habit of mashing F5, especially not for robotically long periods of time. It's basically the difference between a theoretical activity and an actual activity.
Basically, scraping is not regular usage, and I don't think it's correct to pretend that they're equivalent (or more extremely, that scraping is less costly to the website).
While it is true that someone could write a scraper that obviously behaved robotically, it is also true that someone could use their desktop browser in a robotic way. Mashing F5 is so common that there are many ancient memes referring to and making jokes about that activity. There are extensions that end users use to record browser macros, behaviors they want their browser to repeat over and over again.
However, this conversation about whether scrapers behave robotically or not is moot because a web site shouldn't break down under load when someone uses it in a slightly-irregular way. The obvious, crappy scrapers are trivial to block. The ones that blend into the traffic are no harm, no foul. If you can't tell the difference between an optimized browser like a scraper and a general-purpose browser like Chrome, why shouldn't it be allowed to talk to your site?
Just like every university site ever is completely down during signup days because everyone is mashing F5.
Link me your site, I’ll treat it like a college student waiting to be able to sign up for their classes.
To me, for private, personal use, a scraper should emulate a normal human browser as much as possible to avoid causing site problems and to avoid detection. If what you're doing can be done in the background, or by a cron process at some odd hour, it doesn't have to be fast at all, and you can set the timings to be similar to a normal human.
I'd consider that a bug not a feature but I still think it's incumbent on me, the guy scraping the website, not to trigger it.
If someone's production site, thats been around for while, had a bug like this that can be caused by what you describe, I'd love to see how many real users they have. I'm sure its possible under certain circumstances, but its definitely bad engineering that would be caused by literally any traffic.
That said, while obviously you want to avoid triggering the bug since it offlines your data source, this is definitely in the site's court to fix and could easily be triggered by normal usage. Some people browse with cookies disabled, especially since the EU passed its "cookie law", requiring sites to get consent before storing a cookie on visitors' machines. If you've started to notice more sites talking about cookies over the last year, that's why. 
Could also be something like storing hibernates second level cache in session. Unfortunately I've seen this, a significant chunk of the database was being copied into each users session.
And as a webmaster, how can I tell the difference before it's too late?
Analyzing data that you're not allowed to access gives you/your company a competitive advantage, which is affecting their service/business even if it's not posted/distributed publically.
Automated scraping is just a way to drastically reduce labor costs for information collection. Sure, it's a competitive advantage, but I think disallowing it or calling it unethical is a pretty big can of worms. Why is it ok if something is done by humans but not ok if a computer does it by himself?
This gives an unfair advantage to the tech-savvy "hackers". Facebook terms protects against this. Thus scraping it is disallowed.
I couldn't say if it would be moral or immoral to do this. Personally, I'm more concerned about the well being of poor scraper program that has to scrape through an entire decade of Facebook posts. Poor thing.
These are the same websites and companies that are loading evercookies and doing browser fingerprinting, that break as much as possible the anonymity citizens should enjoy, with Real Name policies, using network analysis to find who your friends are and what your politics and buying habbits are, that routinely rip private information from you cell phone and share it with oppressive regimes.
You're not in Kansas anymore Toto.
If your administration don't have the resources (and it's often the case) to maintain a proper JSON API for you to fetch with a fancy python lib, then, it's not "super bad netizen stuff" to scrap a few HTML/PDF/XLS, parse them and display them for convenient public consumption on your personal website (and paying for the bandwidth).
It's 2016. State-companies holding a third party responsible for their own outages and poor planning is _bad faith_. ETL? Never heard of it?
: https://citymapper.com/i/1208/soutenez-citymapper-et-lopen-d... (french)
Yes, this defense is being petty abotu details, but I find businesses using post-hoc discoverable limitations to limit people rights annoying.
Nonsense, there is no implication that this activity is illicit. Many sites (I have worked with hundreds) are happy to be included in my service, but don't have the technical ability to provide a data feed. They were delighted when I told them I could aggregate their content without any extra work on their part.
We respect TOS, we respect robots.txt and so on. Just because you study scraping techniques doesn't mean you intend to break the law.
> Breaking captchas and the like is basically blackhat work
Um, captchas only work if they work. If breaking them is trivial, they shouldn't exist. Don't shoot the messenger for pointing out the front door is unlocked.
Being amazed at this kind of bad behaviour where the targets are some of the most despicable companies on the web is a bit ironic. Scrape away, these companies hurt the web, let's hurt them (even though, all the scraping in the world won't have any impact).
How so? I send a web request, they send me the content in a response. If they aren't happy with that then they should refuse my request.
If I can modify my web browser to view a site, but skip the ads, that should be my right. If the site owner codes their site to detect this and then blocks my request to see their site, that should be their right. If I modify my ad-blocker to get around their ad-blocker-block, that should be my right, and so on. As long as we don't get into something like DDOS territory where a reasonable web site has no good technological way of avoiding the problem caused by a user, this isn't something for government to get involved in.
We need updated legislation that covers malicious actors that issue DDoS attacks but leaves normal people that scrape consciously and carefully alone.
Imagine a hotel that makes guests sign a document saying they will not make photographs of the building. If I'm not a guest, I can take photographs of it and I can't even know that would be illegal.
If you scrape, and effectively reconstitute a database, then so long as the database originally had a "substantial investment" in it's "obtaining, verifying or presenting the contents" then yup... you have breached the database right, which is a modified form of copyright.
You may access said database (via the web), but as soon as you start reconstituting the database from scraping... you're in breach.
It's a law, it is illegal in the UK, I'm sure most countries have some equivalent law on their books, all of the EU does. The law looks recent, but UK copyright and patent used to cover it, the 1997 date is just a separate statute to clarify the position.
The World Copyright Treaty of the WIPO, which the US also signed, enforces in Article 5 that every member country has to have a database law of this kind.
> Compilations of data or other material, in any form, which by reason of the selection or arrangement of their contents constitute intellectual creations, are protected as such. This protection does not extend to the data or the material itself and is without prejudice to any copyright subsisting in the data or material contained in the compilation.
> United States of America
> Signature: April 12, 1997
> Ratification: September 14, 1999
> In Force: March 6, 2002
If Feist v. Rural occurred now and Rural, like most companies, kept their information in a database online, Feist would lose not for copyright infringement, but for exceeding authorized access to Rural's server.
This isn't even true metaphorically. It's like a shop front: there may be public access, but it is NOT public property.
Taking the store metaphor further, it would be more like you knocking on the front door of a clothing store and the store owners open the door and throw every possible piece of clothing at you, shirts, shorts, underwear, including coupons to "partner" stores, when all you wanted was a pair of pants.
Upon knocking, if the store owner hands you instructions on how to enter their store and interact with their products in a personalized shopping experience, that would be one thing. But when the clothing owner throws everything at you at once, what they flung at you is for all practical purposes public property.
This is called "clickwrap". There is usually a notice in the footer of each page that says something like "By using this site, you agree to our Terms of Service." Typically, this kind of notice has been held enforceable. More recently, judges have been demanding that such notices be placed more prominently before they're held enforceable (e.g., somewhere above the fold), but that's it.
>Imagine a hotel that makes guests sign a document saying they will not make photographs of the building. If I'm not a guest, I can take photographs of it and I can't even know that would be illegal.
The reasonable laws that exist in meatspace are not applicable online, because once you hit someone else's server, you're considered to be on their property and they have the right to control what you do there. There is no "public property" from which to safely stand and take photographs in the internet.
Also, photographs of structures may not be free to use. Architectural copyrights went into effect in the early 90s and have a term of either 90 or 120 years. Thus, if you take a photograph of a building built in 1991 and the year is not yet 2111, there is a chance that the architect can claim infringement.
1. Total privacy, they will not track me activity on their website, including any logs.
2. They will send me a cashier's check for $1,000 for each byte that they send to me.
3. They will provide me with Mana Sakura's cell phone number.
I'm still waiting for checks and a phone number.
It is ridiculous. Something like "pagewrap" can't trump the consumer protections that apply to a physical good like a book, it would be laughed off. But the law doesn't contemplate network access so reasonably.
The architect can claim infringement all they want, they don't have a case. From https://www.law.cornell.edu/uscode/text/17/120 :
The copyright in an architectural work that has been constructed does not include the right to prevent the making, distributing, or public display of pictures, paintings, photographs, or other pictorial representations of the work, if the building in which the work is embodied is located in or ordinarily visible from a public place.
This is an important caveat to architectural copyright, however, so thanks for clarifying.
See also https://mentalhealthcop.wordpress.com/2013/09/20/place-to-wh... which shows the same ambiguity exists in the UK.
IANAL but this seems perverse. In no meaningful sense am I on corporate property when my computer in my house sends signals to another computer, formatted so that they will be re-sent in turn to a series of other computers, the last of which decides on its own based entirely on the signal it receives from the penultimate host to send a "response" to a different series of other computers, the last of which is my computer in my house.
Surely there are better ways to enforce IP restrictions than this tortured analogy of networked computing to physical location?
Even if we entertain a distinction between browsewrap and clickwrap, browsewrap is generally enforceable, especially after minor modifications to placement and/or font size.
Even for sighted people, the notice is often easy to miss - and this is by design.
I don't think many websites have a secret ToS that they hope you won't read, I think most of them don't even know what their own ToS say. I signed my lease on a site with an explicit checkmark for ToS that said I agreed I would only use exactly IE7 to use their site.
I suppose I can do no better than quote from the Wikipedia page I linked:
> The Second Circuit then noted that an essential ingredient to contract formation is the mutual manifestation of assent. The court found that "a consumer's clicking on a download button does not communicate assent to contractual terms if the offer did not make clear to the consumer that clicking on the download button would signify assent to those terms."
The same page cites a number of cases where a "browsewrap" agreement was found unenforceable and only one where one was found enforceable - and the latter, for what it's worth, involved a sale taking place through the website rather than anything resembling passive browsing. Of course there exist other cases not listed; and there are situations that muddle the distinction between clickwrap and browsewrap. But at least, the very common pattern of, as you said, burying "a notice in the footer of each page" without anything vaguely resembling active consent, as practiced by probably the majority of commercial websites on the internet, seems to pretty clearly fall on the unenforceable side of the line based on those precedents.
The courts have generally disagreed with that interpretation.
No, its not. It may be in public view, but that's a different issue.
This is a gross misunderstanding of how the internet works.
So I'm not so sure that police will escort you out of a Walmart because they caught you taking a picture of the parking lot with your smartphone.
If you're entering a country, do its laws not apply to you until you've seen a copy of them? "Oh, sorry, no one told me theft is illegal here. Where does it say that? Oh, I see. Okay. I'll stop now. Thanks for letting me know."
If you cross the border without necessary documents, does that country have no right to detain you, simply because you haven't checked the laws?
Just because a website is visible and public doesn't mean its content is public domain. It just means that your first order of business as a user should be to check the terms of service. Sure, most people using a website probably don't need to--same as not needing to check a country's stance on murder--and so can just use the website as intended without violating the terms. But when you plan on using it in a way that might not be intended, and you don't check the terms of service, well, that's on you.
Also we don't even know how many laws there are in the United States for I'd say knowing the content is impossible.
I don't need to check your terms of services if I'm doing something that I'm allowed to do by law anyway; the TOS cannot deny me those rights (they might, of course, grant me additional rights provided that I follow certain conditions).
The entire point of protocols is to precisely define the terms of communication. The status code is '200 OK', not '200 OK/Asterisk'. But of course if lawlers didn't force themselves into the situation, they'd be out of jobs.
As an aside, I'd really like to see a browser plugin that would scrape sites in the normal course of access, storing the proceeds in a distributed public database.
This would be copyright infringement, since the content of the page is a substantive unique work that is automatically copyrighted by its author. A site that doesn't want you scraping its content is not going to want you posting dumps of its pages. Much like BitTorrent, they'd get into the protocol and send subpoenas to the ISPs behind the IPs that serve their pages, and use that info to sue the customer.
When my company was shut down by a legal threat related to scraping, I did suggest to my lawyer that we create something like a browser extension that would grab the data we needed out of normal client-side browsing sessions. This wouldn't be as nice as controlling the flow of information ourselves but it would've worked OK. My lawyer strongly suggested avoiding that as it could've been construed as conspiratorial conduct that would've made criminal prosecution more likely.
Perhaps this illustrates the fungibility of the legal system: it's an inherently human construct that pits a plaintiff against a defendant, and given a big enough warchest and persuasive-enough arguments, catastrophe can be avoided -- by Google; perhaps not by you, me, or someone else.
The main difference when Google was small was that Google was not dependent on any data source in particular, so even if someone denied their robot or sued them, they could cease and desist without affecting the overall value of their offering. This is different if you are getting data that is only available from one or two sources.
Now, the main difference is that Google is one of the biggest companies in the world, and they'll sick an army of $1,000/hr lawyers on you if you even think about taking legal action against them. The only people who can afford to fight are other big companies, but that's not going to happen because they all depend on breaking the CFAA for their own purposes and then using their position as a huge company to bully small innovators.
There are similar rulings for thumbnail images:
And of course books:
Google's primary out here is its reputation (not guarantee) for obeying robots.txt. If Google indexed a page that disallowed it in robots.txt, the case would be much stronger. There's also the unofficial out, which is that judges think Google is a cool large company, so they rule in their favor based on their personal biases.
Fair use is a case-by-case basis, so you can't say that Google's infringing conduct is generally accepted to be fair use. The EFF had to take on Universal in Lenz v. Universal Music Group, and that went up to the Supreme Court. That's how individuals are left to assert their fair use rights.
>Fair use is a case-by-case basis, so you can't say that Google's infringing conduct is generally accepted to be fair use.
There is so much wrong with this statement. For one, how can you call something infringing at the same time you point out that nothing has been proven? That simply defies all common logic.
Secondly, in general terms, the activities in question have been found to be non-infringing by the courts. Sure fair-use is case-by-case but if you're operating within similar parameters as a previously litigated case, then the legal risk is immensely reduced.
I don't disagree with your assertion that the legal system greatly favours the well monied/connected (I don't think anyone would). But you can't claim it to be fact that Google Search is infringing anything with little to no evidence or rulings to cite. Unless you're just stating an opinion in which case you should clearly indicate that.
Fair use is an affirmative defense. Google admits that it copies content without legal license to do so, but claims that said copies are non-infringing under fair use exemptions. I guess you're probably correct that it's no longer appropriate to refer to Google's behavior specifically as "infringing", just "copying without authorization", which, for those of us without $5 million to commit to a legal team, means "infringing". I will try to remember the special standard of law which has been allowed to Google and refer to their copying only as "unauthorized" and not "infringing" in the future.
If you review the points summarized in the Wikipedia articles you helpfully linked, you'll see that Google's defense is mostly "Yeah, but we're Google".
In Field, "the court found that the plaintiff had granted Google an implied, nonexclusive license to display the work because of Field’s failure in using meta tags to prevent his site from being cached by Google.", i.e., because Field already knew Google existed and knew there was a standard way to prevent its access but chose not to employ it, he gave Google an implied license.
Who else does that work for? Can I send an email to Netflix and tell them "Hey, if you don't want me to copy your shows, please add this in your page's HEAD element: <meta name='please-dont-download-my-shows-sir'>"? No?
I understand there are other criteria which were used to decide if Google's use was specifically infringing in addition to the implied license. Just demonstrating that Google is getting favored treatment from the judiciary that would not be available to a normal entity.
In Perfect 10 , the judge even explicitly indicated that he was loathe to find Google's use of thumbnails infringing because he didn't want to "impede the advance of internet technology", but that he felt the law obligated him to do so (his ruling in that matter was overturned on appeal, when the Ninth Circuit found Google's usage non-infringing). What if the defendant had been some company perceived as less technically advanced than Google? This is probably as close as you can get to an explicit statement of favoritism. The Ninth Circuit also rejected Perfect 10's claim that RAM copies were infringing (which was not the case with an unlucky non-Google company discussed further down).
What if I started indexing and rehosting thumbnails? I can assure you that I would get C&D'd almost immediately and I would be forced to shut down because I can't afford to pay lawyers for 3 years while the case works through the system (and to be honest, I'm surprised it only took 3 years). And even if I could, with a reputation less sterling than Google's, there's no reason to believe that a judge would rule in the favor of one useless guy instead of a big company. A judge would look at the case and say "Google's use was fair because it provided a public service [actually cited as part of the justification in most of your linked cases], but this guy is just using it for a few hundred people, it's definitely unfair, he owes that company more money than he'll make in his life, case dismissed".
There are many such cases on the books. I don't know if Google has a direct connection to the reptilian overlords or what, but it seems in most cases where they're not involved, the good side loses.
In Craigslist v. 3Taps, while primarily a CFAA case, 3Taps was found to be infringing copyrights by sampling Craigslist postings in order to allow its clients to plot them on a map. Being a "public service" or a "referential use" didn't matter for them. They were raked over the coals, and it's been that way with most cases.
In Ticketmaster v. RMG Technologies , RMG was found to infringe just by parsing a page. "Defendant's direct liability for copyright infringement is based on the automatically-created copies of ticketmaster.com webpages that are stored on Defendant's computer each time Defendant accesses ticketmaster.com. [...] Defendant contends [...] that such copies could not give rise to copyright liability because their creation constitutes fair use[.] [...] Defendant's fair use defense fails."
Very similar findings were made in Facebook v. Power Ventures, and the founder was left holding a bag of $3 million in personal liability.
This is a thread about the legality of HN users scraping. It seems Google is the only entity capable of making unauthorized copies and then getting courts to agree that it's fair use. For the rest of us, it's infringement, which carries stiff penalties (and this doesn't even broach the CFAA portion of the issue).
So when I say "infringing", I mean something that would be considered infringing if you aren't Google. It's apparently only infringement if the judges involved don't personally use your site and don't have to worry about personally suffering the consequences of not having access to it. :)
>Can I send an email to Netflix and tell them "Hey, if you don't want me to copy your shows, please add this in your page's HEAD element: <meta name='please-dont-download-my-shows-sir'>"?
Actually, under fair use you certainly can make a personal copy (see Betamax case). If you distribute the work you would likely run afoul of the criteria summarized above.
The robots.txt relevancy is being over stated in your argument. The main criteria used in this case is summarized above. The fact that Google provides an opt-out mechanism is a secondary, supporting argument.
>What if I started indexing and rehosting thumbnails? I can assure you that I would get C&D'd almost immediately
A determination of infringement would depend entirely on the context as related to the afore mentioned criteria. The fact that someone might try to sue is a product of the terrible system in general and you're absolutely right - as with any legal matter the entity with the deeper pockets can often bully the other guy into submission.
>In Craigslist v. 3Taps, while primarily a CFAA case, 3Taps was found to be infringing copyrights
My understanding is that the copyright part of the case was thrown out  and thus was settled solely around CFAA matters.
>In Ticketmaster v. RMG Technologies , RMG was found to infringe just by parsing a page.
I agree that the logic used for the judgement is absurd (for reasons that are plainly obvious to any HN user). But it's less clear whether the case would meet fair use criteria outlined above should it have come to that. My guess is that it wouldn't qualify since the usage affects the copyright holders ability to make money on the work and doesn't meet any of the other criteria for Fair Use.
>Facebook v. Power Ventures
This is not a case involving a defense of fair use (as far as I can tell). Facebook even acknowledged the users owned the data and had a right to it. The defendant was actually found to be violating CFAA and CAN-SPAM acts.
>It seems Google is the only entity capable of making unauthorized copies and then getting courts to agree that it's fair use. For the rest of us, it's infringement
Provably false . It sounds like perhaps your personal experience has soured your opinion on the matter? That's understandable. But none of the evidence you've cited supports the argument that Google is infringing copyrights in its core activities nor that Google is the only entity where copyright laws and fair use legislation don't apply.
PS: To be clear, my argument revolves specifically around copyright infringement and fair use. I don't have enough understanding of other, separate legislation like CFAA to comment on that except to say that it seems overly broad and unrealistic. But that's another topic. I'm specifically arguing against calling Google a copyright infringer in a broad sense which is what you've done. That's not been proven.
Yes, I understand that the criteria for fair use is defined in the statute. What I'm saying is that like most things brought before judges, arguments can be made either way, and judges seemingly favor Google but not smaller defendants. Thus, while the RAM copies of web pages made by Google are fair use, those made by RMG aren't.
If you look at the Ninth Circuit's ruling in Perfect 10, the length they stretch to reverse the District Court's finding of thumbnails as infringing is ridiculous. It's pretty clear that thumbnails are direct infringements and that you don't invalidate the copyright or create a truly "transformative use" by making it smaller and adding it to an index. Perfect 10 was certainly of this opinion, and I'm sure they saw a real impact to their revenue.
Over the years I've learned that no position is too high to disregard the human factor. 99% of the time people are going to act primarily to their own benefit and work backwards to find rational (or rational-sounding) arguments to justify it. Judges are politicians and they're very image-conscious. None of them wants to be the one to make Google Image Search useless.
You seem to be saying that since Google's use was found non-infringing in these cases, its use is objectively non-infringing. I don't agree with this. Rather, I think that Google's conduct is a pretty plain violation of the relevant statute(s) and that most of it is not covered under fair use, the way the laws are currently written. I think that judges apply the statute in full force when smaller defendants present, but that they have a bias for Google (which is really a bias for themselves, since they know that serious backlash awaits the judge who puts the kabosh on it) that causes them to contort the law pretty heavily so that they can rule the way they want to.
>Actually, under fair use you certainly can make a personal copy (see Betamax case).
See, we were on the right track before we got into networks. Since then, the rulings have been pretty darn bad. The modern "Betamax case" may well have been American Broadcasting Cos. v. Aereo, Inc. , and it wasn't a win for us.
Note also that separate from the copyright concern, the DMCA makes it illegal to circumvent a copy protection device (or indeed, even to teach another how to do so). Since Netflix employs DRM, even if there is a fair-use right to a copy of a Netflix program (which is by no means certain), you'd probably have to break the DMCA to obtain it.
>The robots.txt relevancy is being over stated in your argument. The main criteria used in this case is summarized above. The fact that Google provides an opt-out mechanism is a secondary, supporting argument.
I disagree. Google has been able to discharge all CFAA claims because the judges have said "Well, you knew there was a way to stop it." If that's the logic, I'll happily inform the parties I may scrape that there's a way to stop it.
>A determination of infringement would depend entirely on the context as related to the afore mentioned criteria.
Yes, I understand that the judge would write a report that appeared to consider the relevant criteria. The real question is, would that judge be willing to make the same logical contortions that other judges have made for Google?
I think that he would just go in favor of his biases, and right now we have a judiciary that is heavily biased against the little guy from the start, and this is only exacerbated by an inability to retain hotshot lawyers.
>My understanding is that the copyright part of the case was thrown out and thus was settled solely around CFAA matters.
The only portion of the copyright claim that was dismissed was Craigslist's claim that it owned an exclusive license in the scraped content. This was based on a short-lived ToU update that was specifically intended to strengthen Cragislist's case in this instance. The remaining copyright-related claims were allowed to stand, including a claim that Padmapper had violated a copyright Craigslist holds on the collection of advertisements (rather than on the advertisements themselves). 
>[re: RMG] I agree that the logic used for the judgement is absurd (for reasons that are plainly obvious to any HN user).
If you agree the logic was absurd, you agree that a copy of the page that exists in RAM for microseconds does not qualify as a protected copy any more than the reflection of an image on one's retina qualifies. As a "copy" that should be ineligible for copy protection, it doesn't matter if it qualifies for fair use (and I don't necessarily agree that it wouldn't).
> [re: Facebook v. Power] This is not a case involving a defense of fair use (as far as I can tell).
Correct. I was including it because it's an example of Google getting another free pass for stuff that shuts others down, which is the CFAA. CFAA claims are raised against Google in at least Field and Perfect 10, and they get dismissed based on the judge's assumption that the plaintiff knows about the special steps Google makes you take to stop them from violating the CFAA, the absurdity of which we've already discussed.
My wording that the "findings were very similar" was definitely bad since a different law was in play. I meant they were very similar in nature, not in fact. That said, it's likely the only reason that the cached pages weren't considered infringement is that Facebook didn't bring it up.
>But none of the evidence you've cited supports the argument that Google is infringing copyrights in its core activities nor that Google is the only entity where copyright laws and fair use legislation don't apply.
Again, I'm discussing this from a practical position, not one that is strictly compliant with legal theory, where judges always enforce the law with perfect equity, and in which anything a judge (or jury) finds becomes Official Truth de-facto.
From a textbook perspective, sure, everyone has all the same rights and the legal system is always applied equitably. I simply don't believe that has borne out in practice when it comes to internet-centric companies that aren't household names.
It seems that the things Google does are considered infringement when other people do them. Thus, it behooves to know the actual law and follow it, even if Google gets a free pass, since we can't rely the judiciary to interpret the law favorably for us.
RMG is a great example because it occurred after Perfect 10, and the same argument against RAM copies was raised in both cases. It's apparently fair use if Google scrapes your page to download and rehost all of your images, but it's not fair use to read out non-copyrightable factual data unobtainable from any other source (like ticket prices and event times) and rehost it nowhere. Sure.
The alternate lesson here is to focus on getting really big and powerful really quickly, and making sure you cultivate a good public image, so that judges are afraid to rule against you in ways that would affect a product offering upon which millions of people depend. That seems to have worked for most big internet companies, actually. Definitely worked for Facebook and Google.
 http://www.dmlp.org/sites/dmlp.org/files/2013-04-30-Order%20... pgs. 9-16
They don't even need to do that. They just cheerfully agree to not scrape you, and wait for you to come back and beg to be re-instated when your search traffic plummets.
I think imbuing technical protocols with legal implications would be even worse than the current situation since then changing anything on a protocol would require changing the law and getting a protocol implementation slightly wrong would carry real-world legal repercussions on the order of licensing your work in the public domain rather than retaining copyright. Let the lawyers make the law and check the human terms of service before using the data. Trying to out-lawyer the lawyers is like challenging a hedgehog to a butt-kicking brawl.
"The User-Agent request-header field contains information about the user agent originating the request. This is for [...] the tracing of protocol violations [...]. User agents SHOULD include this field with requests"
Many scrapers disregard this part of the protocol. Of course, whether a headless browser should send a different UA is an interesting question.
The CFAA is a really bad law and creates the network effect lock-in that we all considered a natural part of the web. It doesn't have to be that way -- users should be free to use any browsing appliance they want, including so-called "scrapers".
Big companies like Google not only got their start by flagrantly violating the CFAA, copyright, and privacy laws, but they continue to do so. The moral of the story is hurry up and get big before you get sued or arrested.
There's a long history of ridiculous web scraping rulings based on technical misunderstandings by neophyte judges, including Ticketmaster v. RMG, where infringement was found because the company scraped data out of a page with the Ticketmaster logo on it.
Facebook sued a company called Power Ventures which read out only the user's own data. The founder was found personally liable for $3 million in damages. Facebook did this because they don't want it to be easy for their users to move between social media services. If it's easy, Facebook has to compete on merit instead of just keeping switching costs high. Facebook doesn't like that, so they sue people who make it possible -- and the law says they should win.
We badly need a revised law, but the powers-that-be will strongly oppose it because it would threaten their monopoly over web properties. They continue to flaunt their strategic ignorance of these laws and then take shelter behind them to stop risk from small innovators (i.e., having to compete fair and square).
In the real world, we have a lot of laws that mostly prevent this kind of bad behavior. In cyberspace, the structure is such that most of those laws are not applicable. We need to update and port the pro-small-business logic we have for meatspace companies so that it counts online too. The state of affairs online is really bad.
I want to get a law called the "Consumer Data Freedom Act" passed, which would allow users to access any web property with any non-disruptive browsing device, including custom scrapers that don't impose much more load than a typical user browsing session would.
Judge: George Jung, you stand accused of possession of six hundred and sixty pounds of marijuana with intent to distribute. How do you plead?
George: Your honor, I'd like to say a few words to the court if I may.
Judge: Well, you're gonna have to stop slouching and stand up to address this court, sir.
George: [stands] Alright. Well, in all honesty, I don't feel that what I've done is a crime. And I think it's illogical and irresponsible for you to sentence me to prison. Because, when you think about it, what did I really do? I crossed an imaginary line with a bunch of plants. I mean, you say I'm an outlaw, you say I'm a thief, but where's the Christmas dinner for the people on relief? Huh? You say you're looking for someone who's never weak but always strong, to gather flowers constantly whether you are right or wrong, someone to open each and every door, but it ain't me, babe, huh? No, no, no, it ain't me, babe. It ain't me you're looking for, babe. You follow?
Judge: Yeah... Gosh, you know, your concepts are really interesting, Mister Jung.
George: Thank you.
Judge: Unfortunately for you, the line you crossed was real and the plants you brought with you were illegal, so your bail is twenty thousand dollars.
I’d assume a lot of HN users are from such locales.
We don’t always have to assume US laws apply globally – they don’t.
anti-scraping: If somebody were to offer a telephone book database online and you created a copy of that to sell on your own, you'd almost certainly loose in the EU (since unlike in the US, databases as pure collections of facts have their own copyright protections)
The legally safest locations probably are outside the western world if you are targeting western sites.
Every case I've seen wrt Ryanair (they sue a lot of people) has resulted in a win for Ryanair. Do you have details on the case you're describing?
Scraping purely factual data is one of my points of defense in the US. I don't want to give it away.
>It's still risky though, the safest locations probably are outside the western world if you are targeting western sites.
Yeah, this was ultimately the conclusion I had to come to. However, outside the West, the Western companies will just send someone with a briefcase full of $100 bills and pay them off. Corrupt government officials in these locations want the goodwill of a big American company a lot more than they care about any particular random guy.
There is only one workable solution: run the service totally anonymously and maintain good opsec so that your cover isn't blown. All under the table. This has its own issues, like making it difficult to receive payment and putting one at much greater legal risk than a mere CFAA dispute, but it's the only option if you don't plan to get shut down.
I edited my original comment to reflect that.
Also, since this is somewhat untouched territory, don't be so sure that you'll get a judge who is as well-versed in web scarping and infrastructure as you, or shares your opinions on the subject. (And given that precedents are so important in US laws, you better hope someone else before you didn't get such a judge.)
If you're not going to run it totally anonymously, you should be prepared to jettison and repackage it when you get found it (so that you appear to be complying with the C&D).
Scraping is a huge part of the web, and everyone does it. It sucks that it has to live underground because only big companies can duke it out in court.
I played with the idea of creating some social aggregation type service with some friends (as a business). The more I read about FB's past behavior with regard to this, and how essential they are to any sort of service like, that, I canned the project. Regardless of what their TOS say, if you get on their radar and they send you a cease-and-desist, it's game over. Facebook is not in the business of subverting their revenue stream, so if you are making money off them and it's preventing them from capitalizing on their users, don't expect to last long if you exist by their grace.
Really, there's an interesting space between so small nobody cares and large enough that getting shut down is a real problem. A lot of projects start small and end up (relatively) large, but without a good way to pay for the service itself. While not every service needs to be a business and make money, once you reach the level where you risk either being shut out of your data source or you need to somehow work out an understanding with that source, how do you approach that when being able to pay is off the table? Not to mention the problem approaching before you have to and forcing the situation, or waiting too long and risking the wrath of the source because you've abused their service as long as you have. Has anyone else been in this situation and found an approach that works?
I understand the use of ToS clauses to prevent scraping but I do kind of wonder to what extent they have authority here.
IANAL, but surely this would fall under copyright law? While re-publishing copyright-protected data without consent is probably unlawful in your region (like scraping an art site and re-posting the images), I wouldn't think just scraping data points for a different purpose (like scraping amazon for the purposes of price comparison) is nearly so clear cut (or enforceable), but maybe I'm just naive.
Companies like PriceZombie are forced to stop because the CFAA says that Amazon can prevent them from accessing their servers by decree alone. A ToS isn't even really necessary for this, but it helps them pin down their argument.
PriceZombie could try to get the data from third-party caches, but it only solves part of the problem, because copyright and trademarks come back into the picture once you have a replica of the target page. In Ticketmaster v. RMG Technologies, the judge found RMG infringing on Ticketmaster's trademarks and copyrights because the page they were scraping included Ticketmaster's logo. The judge said the copy of the full page that existed momentarily in RAM while the scraper extracted the non-copyrightable data constituted a copy that infringed on Ticketmaster's rights, even though the logo was never used by the application in any way, it just happened to be on the page.
If you were referring to a different decision I'd love to read it. I follow this stuff (and at one time explored what legal action our startup could take against scrapers). In our case we also offered a paid API so it was fairly easy to establish damages.
"The panel held that the defendant, a former employee
whose computer access credentials were revoked, acted
“without authorization” in violation of the CFAA when he or his former employee co-conspirators used the login
credentials of a current employee to gain access to computer data owned by the former employer and to circumvent the revocation of access. "
I think that case is unambiguous - this guy was using someone else's credentials to access secured systems after having been explicitly told that he could not. I was referring to the MySpace case.
I don't think these two cases are in conflict; IMO they are very different. Additionally, for our purposes in this comment thread, we're talking about scraping of publicly available websites by outside parties, not by former employees whose access has been explicitly revoked. That is different than either of these cases.
The CFAA says it's a crime to exceed "authorized access". Authorized access is whatever the server's owner says it is. If they change their mind, you must cease and desist or risk both civil and criminal penalties. A contract defining the length and nature of your authorization from the server's owner would go a long way to establishing your rights to access, but no one is going to give that to a small player.
Another issue is that on the internet, jurisdiction is a very messy affair. An American judge will likely determine that California and/or the federal government has jurisdiction over such a case because Google is based in California. Most developed countries have treaties with one another that allow them to enforce foreign civil judgments on behalf of the jurisdiction that entered them. Most developed countries also have mutual extradition treaties. The countries that don't can easily be paid off by an interested party.
For me it's purely for personal use and my little side projects. I don't even like the word scraping because it comes loaded with so many negative connotations (which sparked this whole comment thread) - and for a good reason - it's reflective of how the the demand in the market. People want cheap leads to spam, and that's bad use of technology.
Generally I tend to focus more on words and phases like 'automation' and 'scripting a bot'. I'm just automating my life, I'm writing a bot to replace what I would have to do on a daily basis - like looking on Facebook for some gifs and videos then manually posting them to my site. Would I spend an hour each and every day doing this? No, I'm much more lazier than that.
Who is anyone to tell me what I can and can't automate in my life?
You can try telling the judge it's baloney, but if he's going by current precedent, he probably won't agree with you.
You are exactly right. But although a site can deny you access for any arbitrary reason (it's their website, after all) obviously government think they are the ones to enforce this crap.
What if the ToS say you can only access a site while jumping hoops? Only read the ToS after a while and wasn't hooping? Well too bad, now you are being sued for reading the main page _and_ the ToS page without jumping around.
This comment Terms of Service: If you read any of this text you owe lerpa $1.000.000 to be paid up until 09/01/2016.
They consider their data to be theirs, even though they published it on the internet.
They consider your data (your personal integrity) to be theirs as well, because how can you assume personal integrity when you are surfing the internet?
I have high hopes that the judicial system some time not too far from now will realize that since the law should be a reflection of the current moral standings it will always be behind, trying to catch up with us and that those who break the law while not breaking the current moral standings are still "good citizens" unworthy of prison or fines.
I guess Google won this iteration of the internet because of the double-standars site owners stand by, to allow Google to scrape anything while hindering any competitors from doing the same. There will only be a true competitor to Google when we in the next iteration of the internet realize that searching vast amounts of data (the internet) is a solved problem, that anyone can do as good a job as Google, and move on to the next quirk, around wich there will be competition, and in the end that quirk will be solved, we'll have a winner, signaling that is it time to move on to the next iteration.
Call my cynical if you will, but I'd leave "while abiding the law" out of that, or at least replace it with "while hoping they aren't breaking the law". Due diligence on these matters is often sadly lacking. They'll take the information first and only consider any such implications when/if they come up later.
Large organisations like Google probably will make the up-front effort to remain legal, because they are in the public eye enough for lack of doing so to attract a lot of unwanted press, but you don't have to get a lot smaller than that to start finding companies who are a lot less careful (or in some cases wilfully negligent).
For instance the browser choice script that came with Windows imposed by the EU never worked. It was a "bug". Somehow they must have omitted to test the feature...
Until last year Microsoft started playing nice, and I think Google and Facebook have become the new corporate villains. But recently the Windows team seems to be minded to challenge them in that position.
I might have accepted terms when I created a Google Account but in no way do I agree to a TOS by visiting a URL.
If that doesn't hold up in court, in future on your first visit to Google it will simply display some text and require that you click 'I agree' to continue.
Either way, it seems reasonable to me that you should agree to their terms in order to use their service.
I'm honestly wondering about the double standard. There is a rational way to discuss morality/ethics and subsequent laws regarding most technical aspects, that often mirrors real world (read: offline/analog) scenarios. It's unfortunate that the legal system has instead been appropriated by lawyers.
It's unfortunate that the internet has instead been appropriated by hackers.
It's unfortunate that the stock market has instead been appropriated by traders.
It's unfortunate that the asylum has instead been appropriated by inmates.
Very few traders went to jail after 2008. Seemingly legal (or at least not illegal). Should they have? Most bright/talented lawyers are likely working (again within the law) to get megacorps or rich people off for something poorer people would not. In our field this OP is one of the issues. What information is free and what information is not? What things I'm allowed to do offline am I allowed to do online?
I'm not proposing a solution, but any system populated by humans will be abused by some, and fought for by some idealists, all within that systems rules.
Let's take murder:
I stab someone: murder.
I use a broom to push a flower pot off a balcony hitting someone in the head, killing them: murder.
I swat a butterfly in Beijing, causing a chain of events to a container crushing a dock worker in Rotterdam. Murder? If this extreme example comes down to intent it's thought crime, otherwise I'm playing within the rules of the system, and I just killed someone, scot-free.
While there apparently were no laws prohibiting the upsale of bad mortgages, and banks having the resources to move the market towards more and worse mortgages, that also was within the systems rules, but I personally think it's far beyond the intended use of that market, and well outside the spirit of the laws.
There's a huge difference between judicial justice and what most would agree was "justice". That's where my first comment came in. True about most systems.
I'm not sure why you dislike this 'appropriated by lawyers' outcome: For web crawling look at robots.txt, for other uses look at the Terms link on the homepage. If you don't agree to the terms then stop accessing the website. Seems straightforward and fair to me.
Scraping and crawling is the same thing btw. I absolutely love how the English language has several words for the same thing. Your language very expressive.
Google is a scraper. Your data will end up in their index. You are perfectly OK with Google "stealing" your data.
A new player crawling your site is an offence to you. How dare someone other than Google or Bing put preasure on my site? How dare they steal my data?
TOS is a joke.
I wonder, what was the intention of the founding fathers of the internet, of the internet? Was it not to make data publicly available?
This statement is demonstrably false, as shown by all the places in the world where this type of TOS-nonsense actually does not hold up in court.
And in the USA, it's (as usual) even slightly more absurd: The only reason it does hold up in court is because Google can afford justice.
If Google's actions were illegal, I'm sure that they would have been sued even if their scraping and indexing usually is helpful for the website owner
I suspect I'm one of those bad people your parents tell you to avoid - by that I mean I completely ignore robots.txt.
At this point, my architecture has settled on a distributed RPC system with a rotating swarm of clients. I use RabbitMQ for message passing middleware, SaltStack for automated VM provisioning, and python everywhere for everything else. Using some randomization, and a list of the top n user agents, I can randomly generate about ~800K unique but valid-looking UAs. Selenium+PhantomJS gets you through non-capcha cloudflare. Backing storage is Postgres.
Database triggers do row versioning, and I wind up with what is basically a mini internet-archive of my own, with periodic snapshots of a site over time. Additionally, I have a readability-like processing layer that re-writes the page content in hopes of making the resulting layout actually pleasant to read on, with pluggable rulesets that determine page element decomposition.
At this point, I have a system that is, as far as I can tell, definitionally a botnet. The only things is I actually pay for the hosts.
Scaling something like this up to high volume is really an interesting challenge. My hosts are physically distributed, and just maintaining the RabbitMQ socket links is hard. I've actually had to do some hacking on the RabbitMQ library to let it handle the various ways I've seen a socket get wedged, and I still have some reliability issues in the SaltStack-DigitalOcean interface where VM creation gets stuck in a infinite loop, leading to me bleeding all my hosts. I also had to implement my own message fragmentation on top of RabbitMQ, because literally no AMQP library I found could reliably handle large (>100K) messages without eventually wedging.
There are other fun problems too, like the fact that I have a postgres database that's ~700 GB in size, which means you have to spend time considering your DB design and doing query optimization too. I apparently have big data problems in my bedroom (My home servers are in my bedroom closet).
It's all on github, FWIW:
Agent and salt scheduler: https://github.com/fake-name/AutoTriever
Or... well, 4 separate projects. Whoops?
At one point, a friend and I were looking at trying to basically replicate the google deep-dream neural net thing, only with a training set of porn. It turns out getting a well tagged dataset for training is somewhat challenging.
Well-tagged hentai is trivially accessible, though. I think there's probably a paper or two in there about the demographics of the two fan groups. People are fascinating.
Next up, automate the consumption too!
And what is served through their website is resized. So web-scraping is an inferior approach.
1. I'm scraping the resized galleries.
2. I don't have the Hath perk that makes the galleries full sized.
3. I don't have a phash-based fuzzy image deduplication system on top of all this (see https://github.com/fake-name/IntraArchiveDeduplicator). It's main purpose is to deduplicate manga (https://github.com/fake-name/MangaCMS).
I have huge, uh, "datasets" around still, though.
I'm not scraping high value sites like that (I mostly target amateur original content). It's not really of interest to other businesses. As such, I tend to just run into things like normal cloud-flare wrapped sites, and one place that tried to detect bots and return intentionally garbled data.
If I run into that sort of thing, I guess we'll see.
But if the end justifies the means... http://luminati.io/
As it is, I think I'm OK, since it's basically just a "website DVR" type thing, for my own use.
Really, if nothing else, the project has been enormously educational for me. I've learnt a boatload about distributed systems, learned a bit of SQL, dicked about with databases a bunch, and actually experienced deploying a complex multi-component application across multiple disparate data centers.
I see you don't have a license listed on GitHub. Do you have a license in mind for these?
This isn't quite as fancy as readability, though I integrated a port of readability for a while. Now I just write a ruleset for a site that has stuff that interests me.
Regarding costs, I really have no idea. It depends on how rapidly you cycle the UA, and how fast whatever you're scraping is.