Hacker News new | comments | show | ask | jobs | submit login
How Web Scraping Is Revealing Lobbying and Corruption in Peru (scrapinghub.com)
398 points by bezzi on Mar 9, 2016 | hide | past | web | favorite | 72 comments



I'm from Ukraine and the biggest success in battling corruption comes from system called Prozorro[1] (transparently) for government tenders.

It started as volunteer project and some projections put savings at around 10% of total budget after it will become mandatory in April.

[1] https://github.com/openprocurement/


Hi there, I am the author of the blog post. I will be happy to answer any question.


This is great work. Forgive me if I'm missing it, but since the blog post implies you're aggregating and cleaning the data from several lists, is there any way to see the latest additions (RSS etc?) rather than directly searching for individuals?

It would make it more useful for flagging up potential stories, as well as researching stories journalists are already writing.

disclosure: I work for a company that provides real-time data to journalists for story discovery, and I know we'd certainly be interested


I never thought of that, but certainly having a RSS feed is a great idea. I have not done it as the journalists have not requested it. So far they have been asking me for more spiders so Manolo would include visit records from other Peruvian institutions.


If you do choose to implement it, please do let me know (email in profile), I'll definitely make sure we use it.


Should be a fairly easy thing to do, this is Python and Scrapy!

http://stackoverflow.com/questions/28127396/creating-rss-wit...


Carlos, super buen trabajo, felicitaciones!! Llevo tiempo estudiando temas relacionados a tecnología vs. corrupción desde acá en Berkeley. Tengo testimonios interesantes de contactos que han vivido el cambio post-tecnología en el gobierno. Perú tiene harto potencial en esta área. Si necesitas ayuda en cualquier momento feliz de apoyarte!


muchas gracias! En el Perú ya hay varios grupos de periodistas que se han asociado con programadores para hacer proyectos interesantes de periodismo de datos. Está Ojo Publico, Convoca e IDL reporteros. Pero igual no nos damos abasto hay tanto por hacer!


nsoldiac:

Carlos, super good job, congratulations!! I study issues related to technology vs. corruption from here in Berkeley. I have interesting evidence from contacts who have experienced post-technology change in government. Peru has (too much) potential in this area. If you need help at any time I am happy to support you!

carlosp420:

thank you! In Peru there are already several groups of journalists who have partnered with developers to make interesting data journalism projects. There is Ojo Publico [1], Convoca [2] and IDL Reporters [3]. But all the same, it's not enough, there's so much to do!

[1] http://ojo-publico.com/ [2] http://www.convoca.pe/ [3] https://idl-reporteros.pe/


Igualmente. También vivo en el extranjero (Canada), pero estoy totalmente dispuesto a apoyar en cualquier iniciativa de este tipo.


Good work! It would be interesting to cross match the visits with any other source of information (newspaper, wikileaks, etc.) Over a timeline to recreate the hole event of someone. This will allow to identify patterns and their modus operandi.


It would be interesting to see the volume of visits by government office year over year. I have a feeling that periods around elections might look very different. Also would be interested to see distribution color-coded by industry. Mining and contracting should pop up for certain time periods and government agencies.


yes. So far we have a very simple API http://manolo.rocks/docs/ With this API, it is possible to download all the structured data kept in Manolo and do such interesting analyses.

Or maybe that can be implemented in Manolo's GUI. It should not be difficult as it is based on Django.


Existe alguna fuente de informacion como la de Peru, para Bolivia? Me imagino que hay mucho que descrubir sobre la corrupcion en Bolivia y el trafico de influencias.


Very interesting, how tools like these can be so much helpful for journalists and generally transparency in government functions.

Probably world changing, when considering that even semi-technical folks can cook up tools to dig into things like this.

I know this tool was by a developer, but scrapinghub has web UI to make scrapers.


Full disclosure, I work for Scrapinghub and the web UI you speak of is Portia - our open source visual web scraper. It's for those who range from non-technical to technical but want a quick way to scrape data. I think it's extremely important to develop tools to democratize the acquisition of data regardless of technical background and skill. Glad you find the article and tool interesting!


Yes, totally agree with you on the great potential of tools for easy data acquisition.

I have personally used Scrapy in the past, I find it to be a great tool.

Congratulations on your work!


Thank you. Glad you enjoy Scrapy, we're pretty fond of it ourselves!


A similar thing happened in Costa Rica -

    “You can’t visit 160,000 people,” she notes. “But 
    you can easily interrogate 160,000 records.”
http://foreignpolicy.com/2015/05/27/the-data-sleuths-of-san-...


Can you draw a covisit graph of people? Who visited the building at the same times as somebody else. The strength of the connections could be visitedboth^2/( visitedwithouttheother1+1)*(visitedwithouttheother2+1)))


In other countries, corrupt politicians found out a simple captcha per n items is good enough to defeat analysis.


https://anti-captcha.com/ & https://rucaptcha.com/ - I think that can be best summarised as "from Russia with love" :)


FWIW, if you live in the U.S., then you benefit from having such data in great quantity, though I don't think it's sliced-and-diced to near the potential that it has:

Lobbyists have to follow registration procedures, and their official interactions and contributions are posted to an official database that can be downloaded as bulk XML:

http://www.senate.gov/legislative/lobbyingdisc.htm#lobbyingd...

Could they lie? Sure, but in the basic analysis that I've done, they generally don't feel the need to...or rather, things that I would have thought that lobbyists/causes would hide, they don't. Perhaps the consequences of getting caught (e.g. in an investigation that discovers a coverup) far outweigh the annoyance of filing the proper paperwork...having it recorded in a XML database that few people take the time to parse is probably enough obscurity for most situations.

There's also the White House visitor database, which does have some outright admissions, but still contains valuable information if you know how to filter the columns:

https://www.whitehouse.gov/briefing-room/disclosures/visitor...

But it's also a case (as it is with most data) where having some political knowledge is almost as important as being good at data-wrangling. For example, it's trivial to discover that Rahm Emanuel had few visitors despite is key role, so you'd have to be able to notice than and then take the extra step to find out his workaround:

http://www.nytimes.com/2010/06/25/us/politics/25caribou.html

And then there are the many bespoke systems and logs you can find if you do a little research. The FDA, for example, has a calendar of FDA officials' contacts with outside people...again, it might not contain everything but it's difficult enough to parse that being able to mine it (and having some domain knowledge) will still yield interesting insights: http://www.fda.gov/NewsEvents/MeetingsConferencesWorkshops/P...

There's also OIRA, which I haven't ever looked at but seems to have the same potential of finding underreported links if you have the patience to parse and text mine it: https://www.whitehouse.gov/omb/oira_0910_meetings/

And of course, there's just the good ol FEC contributions database, which at least shows you individuals (and who they work for): https://github.com/datahoarder/fec_individual_donors

This is not to undermine what's described in the OP...but just to show how lucky you are if you're in the U.S. when it comes to dealing with official records. They don't contain everything perhaps but there's definitely enough (nevermind what you can obtain through FOIA by being the first person to ask for things) out there to explore influence and politics without as many technical hurdles.


Thanks; it's invaluable to hear from someone who has experience with the data.

Do you know what they are required to report? For example, if they have a 'social' dinner with a lobbyist, must that be reported? Are the requirements the same across the Executive Branch? All three branches?


I don't have much experience with the lobbying rules except for times that I've had to research things specifically. Usually disclosure requirements come with a minimum amount...In the House (not sure if the exact limits apply to the Senate...), the ethics rules are quite strict but not everything is recorded...for example, a legislator (or their staff) can only receive $100 of gifts from a single source in a calendar year..."gifts" being basically anything of value...but things under $10 don't count toward that limit. So getting Frappuccinos everyday with your favorite CEO probably wouldn't be recorded in any official capacity even though not only do those add up monetarily, but someone getting coffee with a legislator on a frequent basis would be a huge point of potential influence. However, legislators aren't allowed to get gifts (such as paid dinners) at all from a registered lobbyist [1].

Both the House and the Senate have gift travel databases (travel that's reimbursed by an outside group, such as a charter flight to visit an oil drilling rig) [2]

The branches differ in how such things are reported...this was pretty obvious recently when Justice Scalia died at a ranch and people started wondering who paid for the trip...take one look at how these forms are supplied and it should be pretty obvious why we don't normally hear about SCOTUS relationships until something really weird happens [3].

This NYT editorial "So Who's a Lobbyist?" has a nice rundown of the ways that people who would generally be considered a lobbyist can escape disclosure requirements: http://www.nytimes.com/2012/01/27/opinion/so-whos-a-lobbyist...

Still, it's useful to be able to parse the dataset in an attempt to find what's missing...something that is difficult to do conceptually unless you're dealing with the actual dataset on your own system.

[1] https://ethics.house.gov/gifts/house-gift-rule

[2] http://clerk.house.gov/public_disc/giftTravel.aspx

[3] http://pfds.opensecrets.org/N99999918_2008.pdf


I just ran across https://www.opensecrets.org/ and found it quite useful and comprehensive in tracking contributions to candidates.

I live in the US and am privileged with the level of transparency that exists, but it's still not necessarily enough. Similar issues are present with the clunky nature of government websites and databases and so I think we're in agreement that it's not even close to the potential of what it could be.

Thanks for sharing all the links and information!


> which does have some outright admissions

Did you mean omissions?


oops, yes :)


Damn. This is pretty impressive.


Peruvians, do you think this would cause a majority of meetings to be held outside of public office buildings or via secretive messaging system?


This is really impressive, even more so by the fact that it has already led to discoveries being made.

Web scraping is a really powerful tool for increasing transparency on the internet especially with how transient online data is.

My own project, Transparent[1], has similar goals.

[1] https://www.transparentmetric.com/


This is a fascinating project - If successful I suspect the result will be that lobbying to longer takes place in the government offices: "Shall we meet at that little place down the street", or will be carried out over the phone.


Really interesting use of data extraction....

For developers and managers out there, do you prefer to build your own in-house scrapers or use Scrapy or tools like Mozenda instead? What about import.io and kimono?

I'm asking because lot of developers seem to be adamant against using web scraping tools they didn't develop themselves. Which seems counter productive because you are going into technical debt for an already solved problem.

So developers, what is the perfect web scraping tool you envision?

And it's always a fine balance between people who want to scrape Linkedin to spam people, others looking to do good with the data they scrape, and website owners who get aggressive and threatening when they realize they are getting scraped.

It seems like web scraping is a really shitty business to be in and nobody really wants to pay for it.


I recently did a website, that mines Argentinian Central Bank statistics daily and generates graphics and reports: http://estadisticasbcra.com/en

( The data that I'm mining is published here: http://www.bcra.gov.ar/Estadisticas/estprv010000.asp )

On this case, some scripts using Beautiful Soup were enough to get the job done, but I was completely unaware of Scrapy, seems like a fantastic tool, if I would have known about it I probably would have used it.


Full disclosure, I work for Scrapinghub. Our tools are Scrapy and Portia, both open source and both free as in beer. Scrapy is for those who want fine-tuned manual control and who have a background in Python. Portia is the visual web scraper for those who are non-technical to technical but don't want to bother with code.

Web scraping is everywhere, even if it's not necessarily spoken openly about or acknowledged. The publicized perception of web scraping is fairly negative, but doesn't take into account the benefits of data used in machine-learning or democratized data extraction (as in the case of this article or for building public service apps like transportation notifications), or the simple realities of competitive pricing and monitoring the activities of resellers.

Researchers, academics, data scientists, marketers, the list goes on for those who use web scraping daily.

Glad you enjoyed the article! I'm hoping that more examples of ethical data extraction will start to turn the tide of public perception.


Every time I see a scrapinghub post I ask the same question: what's your strategy for dealing with CFAA suits that arise from use of your platform? Most web scraping is illegal in the United States.

I completely accept how important scraping is as a data source, but that doesn't make it any more legal. It's in a space right now where only big companies can take unmitigated advantage of the tool, because it'd cost millions of dollars to successfully defend a CFAA suit.


Hi! This is definitely not legal advice, so consult with a lawyer and do your own research if you are thinking of applying this to your own practices.

I work for Scrapinghub as well and try to understand the law around this. I can help with some pointers to why I think some web scraping isn't illegal… there are of courses some limits to this.

When the data scraped is "is publicly available on the Internet, without requiring any login, password, or other individualized grant of access", the Eastern District Court of Virginia in Cvent vs Eventbrite (https://casetext.com/case/cvent-inc-v-eventbrite) ruled one could not be deemed to be exceeding unauthorized access.

There are two ways, that I know of, that courts have ruled you can exceed your authorization:

- When the site owner has contacted you and removed your authorization in a written manner, as happened on Craigslist vs 3taps.

- By accepting the terms of service and agreeing against scraping. You have to do this through a "clickwrap" ToS, rather than a "browsewrap". You can read about the differences here: https://termsfeed.com/blog/browsewrap-clickwrap/

As a matter of policy, we don't scrape any site with a ToS with clear anti-scraping language and which forces us to create an account or "constructively agree" as part of the use of the site.

Any user wishing to revoke authorization for anyone using our platform can make an abuse report on our site– we tend to handle these within 24 hours and haven't had a single claim go further than this stage, as we aim to be reasonable and look for a way to provide value to both sides.


Thanks for the response. It is good to know that you're cognizant of these issues, at least.

Most websites have anti-scraping boilerplate in their ToU. I'm pretty sure that's in the "customized" ToU I got from LegalZoom. So you're basically saying that if the data is not behind a login and doesn't require you to fill any forms that contain either a checkbox or nearby language that indicates submitting the form constitutes acceptance of the ToU, you'll scrape it, even if the ToU explicitly bans scraping.

What do you plan to do when you do receive a C&D from someone that claims you've agreed to their browsewrap ToU? Are you going to argue that your use is not unauthorized since the data is public?

I guess I assume you'll comply with any C&D since you state that receipt of a written removal constitutes a revocation of authorization. However, I don't believe that's enough for some. Check out QVC v. Resultly, where QVC sued on a CFAA claim based on browse-wrap agreements (they lost; Resultly asserted permission was granted by robots.txt and the Court agreed).

Beyond just CFAA claims, there are copyright and trademark claims attached to most scraping cases. Those have unfortunately succeeded most of the time. The most egregious is Ticketmaster v RMG, where it was ruled that RMG had violated Ticketmaster's copyright by downloading the page (specifically, making a copy of the page contents in RAM and extracting only non-copyrightable content, and then discarding the complete copy; in short, downloading the page). Facebook v. Power Ventures is also pretty brutal.

I hope Scrapinghub is well-funded enough to trailblaze some space in the law here, because we really need it.


everyone of those cases involve multi-people organizations with significant revenues or funding. These cases are largely reflecting of businesses forcefully shutting down other innovators by claiming some bullshit like CFAA. CFAA should really only apply to people who are doing SQL injections and other penetration. Vast majority of people scraping data do not fall under this category of malevolent behavior, although the dumbassery of people who wants to scrape linkedin for 30 bucks on freelancer ruin it for everyone.

I'm not sure Scrapinghub is funded externally as I couldn't find anything on their valuation or revenues so I assume that they are bootsrapped.

I do not discount any of what you wrote but a lot of it are imagined dangers, scrapinghub would be immune to such cases unless they sided with their customers like 3taps did. 3taps did not stop scraping for their client Padmapper. I don't think scrapinghub is willing to put their neck out for someone paying $20/month to scrape craigslist. In fact, those are the shittiest segments of this market imo, the bottom feeders who demand excessively unrealistic expectations from scraping as a blackbox magical world that will solve their problems.


I received a C&D from a Fortune 100 asserting that I was accessing their site in violation of the CFAA, among numerous other silly claims. I'm a 1 person company. I did have some revenues, but they were about even with my full-time job (this was a side project); certainly not enough to satisfy the retainers that lawyers wanted before they'd even think about taking me on. I eventually found a lawyer who agreed to help a little bit for a $2k retainer, but as you'd expect for that rate, I can't get much out of him.

I wasn't doing anything egregious. The product I offered did not compete with their products; it actually made it easier for the consumer to spend money with them. The data I was gathering is mere factual data and is not subject to copyright (though, as in Ticketmaster v. RMG, this alone will not protect from copyright infringement claims). Their site is the single place that this factual data can be found.

Their Terms of Use forbids access by either manual or automated processes; thus, it makes it illegal for anyone to use their site at all, and precludes any solution based on MTurk or similar. It also forbids any access for "commercial use". Combined, this means they can sue you and make you stop using their site basically whenever they want for any reason. They could've done this anyway because the CFAA protects them from any "unauthorized" access.

If I were to actually dispute this company's claims and refuse to comply with their C&D, they would sue me. This would've cost me millions of dollars in legal fees before the case was through, which is irrelevant to them but obviously well outside of my reach. There's a good chance they would've gotten an injunction legally forbidding me from continuing to offer my service almost immediately, so then I'd have been stopped from offering my product AND I would've had a pending lawsuit against me, which would've asserted some absurd dollar amount of damage, and, if Facebook v Power Ventures is any indication, there would've been a good chance that I would've been held personally liable for it.

It doesn't matter that their claims are all dependent on interpretation and grey area. What matters is that if you don't have $30-$40 million dollars sitting around, you can't take the risk of a lawsuit from a big company. Gotta earmark $1-10 million for legal fees (depending on what kind of lawyers you get; the opposing party in my case has one of the most expensive law firms in the country); set aside $5-10 million in case you lose and have to pay damages, set aside some chunk of money to continue to bear the cost of maintaining and running the business despite the legal pressure and despite the likelihood that you've been legally disallowed from selling your primary money maker pending resolution of the case, which will likely drag on for a minimum of 3-5 years, and up to 10 years is not really unheard of. Gotta have the extra $20 mil+ so that you don't pour more than 50% of your net worth into something that is very possibly a losing battle.

My lawyer advises me that the various workarounds I devised could be construed as conspiracy and aiding and abetting, even though I would no longer be making any requests to the complainant's servers at all. This also wouldn't stop the complainant from suing me for past damages or to stop the practice they dislike, even if I'm doing it through means that totally obviate the need to access any of their servers.

If I were to continue operating, the only option would be to leave the U.S. entirely for a jurisdiction that doesn't enforce U.S. judgments (since I would be sued in the US and lose by default; my lawyer indicates that merely moving my company overseas is insufficient), and not return until the statue of limitations expires on the judgment that would get registered. Even this is not foolproof because the activity would have to be obviously and unequivocally legal in the new host jurisdiction so that the company's lawsuit in that jurisdiction wouldn't get anywhere, the jurisdiction would have to decline to enforce judgments on U.S. persons, and they'd have to be impervious to attempts by one of the world's largest companies to influence their legal system. I haven't found such a jurisdiction yet. Some are kinda-sorta close (but not really).


It's not clear enough what you were doing before that led them to a C&D. Were you doing what Scrapinghub was doing? A web scraping tool vendor and service provider? It sounds like you were doing something shady enough for them to not even email you but C&D you directly.

There's a distinct line between what you do with the data you scrape vs. writing the tools and code to build a script that will get you that data.

You don't arrest the kitchen knife company's CEO because someone used it to stab someone. And the fact that the web scraping/crawler vendors have saturated the market is testament to the fact that you are overreacting.

I'd imagine the fear of facing a devastating legal battle and how it might have permanently shifted your view on web scraping but I see no valid basis for all web scraping services and vendors to shut down.

I also find it puzzling you would be acting against your interests to continue to openly talk in details about a legal situation like this because that padmapper guy pretty much just shut up as soon as the details were involved.

but feel free to provide more details that shows that you were running a web scraping service or software company.


We weren't doing anything remotely nefarious with the non-copyrightable data we gathered. Some details are intentionally unclear. You can continue to make your own inferences on these.

Big companies don't send polite emails asking you to pretty please stop. They just let their lawyers deal with the whole kit and kaboodle.

It is illegal, or close enough to illegal, to scrape from practically any company in the U.S., because "unauthorized access" is a floating definition; as soon as that company makes a decision that they don't want you doing that thing you do anymore, you're doing something illegal; their change of heart can make your previously fine action a crime. The Terms of Use for almost all companies state as much. The statute does not state any required notice period or method, so you'd have to argue to the relevant magistrate that you didn't have reasonable knowledge that your scrape was unauthorized. This is the crux on which all scraping cases have hung, and the results are usually not favorable at all to the scrapers, although 1 or 2 recent decisions are sort of hopeful. Also note that this is only the CFAA portion; these suits usually allege a bunch of other torts too, which have proven similarly difficult to beat.

Scrapinghub's existence depends essentially on luck; first, that they won't get sued, and second, if they do get sued, that they'll get a sympathetic judge who will find that no contract was entered due to insufficient notice. That is not likely due to the nature of scrapinghub's operations (see Register.com v. Verio). The fact that some people are able to scrape and get away without being sued doesn't change the legal reality or the dubiousness of investing in a company with such a large risk profile.

The knife CEO analogy fails because CFAA claims are NOT about how the data is used. They are about the method used to obtain the data. The entity exceeding authorized computer access or accessing a computer without authorization -- in this case, that is scrapinghub, kimono, et al -- is the entity that has committed the violation of the CFAA. In your knife analogy, if the knife company had illegally acquired the metals used to manufacture the knife, it would be the culpable party, not the end user that bought its knives. The data that scrapinghub goes out, obtains, crafts and packages according to customer specifications ("make this page on craigslist a CSV file that auto-updates every 5 minutes") is the metal that the knife company goes out, obtains, crafts and packages according to customer specifications ("make this metal a sharp cutting utensil").

The person using the data that results from CFAA violations may be doing other illegal things, but in almost all of these types of cases, they're not violating the CFAA if they're not the ones accessing the computer that supplies the data.

I'm really not sure what you're arguing about anymore. The CFAA isn't a real law because the person gathering the data isn't necessarily the one putting it to use? I don't understand.


okay I think you are honestly trolling now. nicely played and good bye.


> Most web scraping is illegal in the United States.

Got a source on that? Google scrapes all the time. This is how they index all the pages it discovers.

The only real scenario I recall is 3tap vs craigslist but they just kept scraping craigslist even after they banned their IP addresses with multiple proxies.

Then there are airplane ticket websites scraping each other and getting into hot waters.

Having said that it's not a clear cut definition as you'd like to put it. CFAA ruling was only because craigslist felt directly threatened by Padmapper which relied on 3taps.


They didn't claim CFAA on us (PadMapper), and there was definitely no ruling on it (all parties settled). Just for the record.


Are you stating that there was no CFAA claim, or that PadMapper wasn't the involved party, because it was actually 3Taps? The case against 3Taps definitely included a CFAA claim and the judge refused to dismiss it.


You're right that there was a CFAA claim, but there wasn't one made against us.


WOW. so the guys doing all the heavy lifting (3taps) took all the heat in the end. So looks like 3taps is out of business but padmapper is still up and running....getting data from crowdsourcing? It's really odd that if you made this efficient by automating it then it's hacking.

This really is a shitty shitty business model. All that work 3taps did for you guys and they take all the heat? I don't know why 3taps didn't just comply, was PadMapper 100% of their business?


Please don't edit your posts to substantially modify their meaning after someone has replied to you. You make ericd's response look weird now. Reply to the post again if you want to make a different point.


I couldn't reply because I was submitting too fast so instead of replying I added to my original point which was that 3taps took the heat for Padmapper. The fact padmapper didn't get slapped with CFAA, meant 3taps took the major heat and like you are going on about CFAA as being the biggest blunt force, I don't see why it makes his response look weird. He even wrote that padmapper was not the subject of a CFAA, 3taps was. It makes sense that he can't talk in detail about the case for legal reasons.


I hate that HN does that to anyone. It should be reversed only for spam bots and obvious bad faith participants, not someone with an unpopular opinion trying to have a conversation. I've encountered it before too. Sorry that it happened to you. You may want to lodge a complaint with dang so that he understands it's not a good mechanism.

I definitely think that on the outset, it looks weird that 3Taps ended up taking PadMapper's heat, but I think that 3Taps wanted to become a generalized thing-as-a-service vendor. It's possible that PadMapper wasn't 3Taps's only customer for the CL feeds. As PadMapper wasn't contacting CL's computers without authorization, it makes sense that CL had to change the target to 3Taps. At that point, PadMapper would've seen that scraping CL meant a near-impossible legal challenge for a startup and been wise enough not to implement their own solution.

This is all just speculation, but I doubt that 3Taps stuck its neck out for the sole benefit of PadMapper.


I think there is a delay before the "reply" button appears, for posts past a certain nesting level.

I like this feature because it impedes the rapid nesting of conversations, and also allows the author time to edit his reply before anyone can address it.


No worries, just trying to add some nuance. Probably can't share much there.


>Got a source on that?

The source is the CFAA, which makes it a crime and/or a tort to commit any "unauthorized" access to a computer system. Because authorization is not defined in the statute, it's a matter of interpretation whether or not one's use is unauthorized. Historically, judges have strongly disfavored scrapers.

Most boilerplate Terms of Use contain language that forbids all "spiders, scrapers, bots, and all other automated means of access", or something along those lines. Most companies assert that accessing any page beyond the front page of their site constitutes a binding agreement to their ToU, and thus that any automated access is "unauthorized". Scrapinghub appears to be of the opinion that browsewrap agreements are unenforceable, and while some judges have agreed with that, some haven't.

Beyond the argument that scraping is a breach of contract (violating their Terms of Use) and that since you agreed to that contract, you understood that automated access was unauthorized, there's the potential criminal element, which was deployed against weev for exposing a minor data leak in AT&T's system and against Aaron Swartz for exceeding MIT's authorized access to JSTOR and downloading publicly-funded academic data (including data which was out of copyright). You basically just have to really hope that no one inside the company you've "wronged" is good friends with a prosecutor.

Because there is a lot of grey area around what may or may not constitute "unauthorized" access to a computer system, if a company does bring a tort claim against you for accessing their system without authorization, you might actually win -- if you can afford the time and money to fight them for the minimum 3-5 years it'll take your case to resolve. This is hundreds of thousands in legal fees easy.

3taps eventually had no choice but to give up because they couldn't take the legal costs anymore, and Power Ventures tried to stick it out and ended up not only being held liable for $3 million in damages to Facebook's systems when no actual damage had occurred at all, but the veil was pierced and the founder held personally liable. It's obvious from the court documents that he was struggling to afford counsel, and companies must be represented by an attorney, so he didn't even have an option to try to represent himself.

>Google scrapes all the time. This is how they index all the pages it discovers.

Yes, Google's operations are, strictly speaking, illegal on various fronts. They depend heavily on automated access, which many sites they index explicitly forbid and thus Google is committing "unauthorized access" to these computer systems, and they also store complete copies of the site and the individual images displayed on the site, virtually all of which are protected by copyright, and all of which constitutes flagrant violation of copyright law.

If someone did bring a CFAA claim against Google for this (which no one would, because Google is one of the wealthiest companies in the world, and it'd therefore cost tens of millions to sue them), Google would likely argue that robots.txt is the only authorization it is obligated to observe, which may or may not be an effective argument. Google also make no guarantees about the extent to which it obeys robots.txt; it's a way to signal your desires to Google, which it may or may not honor.

tl;dr The very short answer to all of this is that traditionally, the legal system has been extremely suspicious of scrapers and has treated them very badly, applying concepts intended for the physical world like trespass to chattels to server access. This has been improving somewhat in recent years, but is still a very financially and legally precarious situation in which to find oneself. The people who get away with it get away with it because no one sued them before they were too big to sue.


The tricky thing is that a tool or service provider of scraping if compliant to the demands of website owners to stop scraping, there is very little to claim damages. Even if the customer used scrapinghub to login to websites and scrape all the emails, all scrapinghub would need to do is hand over their customer on a silver platter. This is what the DMCA is for. Can you imagine if you manufactured a bicycle and somebody used it to commit a crime? Plausible deniability. Scrapinghub can't monitor everyone's usage all the time to make sure they are following each websites TOS (which are not legally binding).


The DMCA protects service providers from copyright claims for user-generated content as long as they comply with takedown requests, etc. Scrapinghub may have a defense to copyright claims there (though I seriously doubt it due to the nature of their relationship with the customer; they're not a DMCA "safe harbor" and the data they're using isn't user-generated content), but not to CFAA claims.

It's illegal to break the CFAA whether the plaintiff specifically tells you that they think you're doing it or not. If they send a C&D, yes, you'd be wise to comply, but that's not going to absolve you from claims that you harmed their company by violating the CFAA before they sent it (which do happen and are usually claiming a pretty ridiculously silly amount of damages for something as innocent as downloading a web page from their server). You'd have to argue in court that your access was authorized and they'd have to argue that your access wasn't authorized. The judge and/or jury would then evaluate.

3Taps was actually quite similar to Scrapinghub. I don't think they have as much of a defense as you'd like. And Terms of Use are actually usually considered legally binding; to the extent that they're not, it's usually because of something minor like not putting the notice that you agree to the ToU by using the site in plain view.


I think you are overestimating the reach of CFAA. There's multiple web scraping tool/services as a vendor not just ScrapingHub. All of them have been operating longer than 3taps and some do still scrape craigslist and get away with it without issues for the same reason you could hire a guy on freelancer to scrape craigslist for you. 3taps went above and beyond for their best client padmapper and got burned.


>I think you are overestimating the reach of CFAA.

I don't think so. The CFAA states:

>Whoever intentionally accesses a computer without authorization or exceeds authorized access, and thereby obtains information from any protected computer shall be punished as provided in subsection (c) of this section. (a)(2)(C)

It defines a "protected computer" as:

>...the term "protected computer" means a computer which is used in or affecting interstate or foreign commerce or communication, including a computer located outside the United States that is used in a manner that affects interstate or foreign commerce or communication of the United States; (e)(2)(B)

As the Supreme Court has ruled that virtually anything in the United States is subject to the Commerce Clause, this comprises practically all computers, especially after you consider that usage of a computer network almost certainly takes your traffic out of state. Many states have corollary laws to the CFAA with substantially similar language, so if you can miraculously convince a judge that the computers involved are not part of interstate commerce and that the feds therefore have no jurisdiction, there's a good chance you'll have to contend against a similarly-worded state statute.

I don't see any limitations or exceptions here. If you are accessing a computer in an "unauthorized" manner and obtain information whilst doing so, you have violated the CFAA.

The reason scraping can happen is a combination of lack of technical awareness (both from lawyers about computers and from programmers about law) and the cost of pursuing a lawsuit. Even if you break the law, someone has to take issue with your law-breaking before anything happens; they have to file either a lawsuit or an indictment to get the ball rolling. That some people are able to get away with violating the CFAA without someone registering a formal complaint on the matter has nothing to do with whether or not one has violated the statute.

The only way that scrapers don't violate the CFAA is a liberal interpretation of the term "unauthorized", wherein a judge states that if a computer is advertising and allowing public access, then all members of the public are inherently authorized to access it. I know that several scrapers have taken their cases through the courts hoping that such an interpretation would be given.


I agree, this is definitely a solved problem!

If you need to build a solid web scraping stack which is going to be maintained by many people and is critical to your business, you have two options… to use Scrapy or to build something yourself.

Scrapy has been tried and tested over 6-7 years of community development, as well as being the base infrastructure for a number of >$1B businesses. Not only this, but there is a suite of tools which you have been built around it – Portia for one, but also other lots of useful open source libraries: http://scrapinghub.com/opensource/).

Right now most people still have the issue of having to use xpath or css selectors to run your crawl or get the data, but not for too long.

There's more and more ways of skipping this step and getting at data automatically: https://github.com/redapple/parslepy/wiki/Use-parslepy-with-... https://speakerdeck.com/amontalenti/web-crawling-and-metadat... https://github.com/scrapy/loginform https://github.com/TeamHG-Memex/Formasaurus https://github.com/scrapy/scrapely https://github.com/scrapinghub/webpager https://moz.com/devblog/benchmarking-python-content-extracti...

Scrapy (and also lots of python tools, likely a majority of them created by people using it and BeautifulSoup) have lowered the cost of building web data harvesting systems to the point where one guy can build crawlers for an entire industry in a couple of months.


It doesn't scale very well, unless you have a lot of patience...but I've had immense success using the importxml() function in Google sheets to compile raw election data while I did some freelance work for the Texas Libertarian party a couple of years ago.

Outside of that, I did often find myself building my own tools with a combination of ruby, nokogiri and mechanize. Partly out of a desire to learn something new, and partly many of my use scenarios didn't require anything more complex than "Go to these pages, get the data within these elements and throw a CSV file over there".


After Kimono got shut down, I think a self-hosted open source version would be extremely popular. I want to build my own solution, but the API functionality and pagination / AJAX loaded data would be too difficult.


Not sure if you're interested, but we (Scrapinghub) do offer a Kimono to Portia migration https://blog.scrapinghub.com/2016/02/25/migrate-your-kimono-...

Otherwise, I'd recommend you check out Portia (open source). We're in the middle of releasing the beta 2.0 version.


interesting, how would a self-hosted open source version make money tho in order to support itself and continue to upgrade?

Is this even a realistic business model? Seems like this is what Scrapy is doing and what Import.io is doing. Make the tool free in order to get free marketing and then charge people willing to pay money to extract data.

Meanwhile I see Mozenda charging like 5 cents for each page extracted, do you think this is a fair model or does it not matter?


So for Scrapy and Portia, they are both free as in beer, specifically because we believe in the power of open source. Scrapy actually precedes Scrapinghub and was certainly not developed as a marketing tool.

Charges come with large scale crawls (above certain limits on our platform), additional products like Crawlera (our smart downloader that routes requests from a crawl through a pool of IP addresses to avoid bans), datasets, and for us to handle complex crawls for companies outsourcing to us.

Our model is that there is something for everyone whether you are looking to dip your toes into web scraping (free), use it occasionally (usually journalists) or dependent on web crawling for your business.


>Scrapy actually precedes Scrapinghub

Right. I had first come across Scrapy, while browsing the web for Python software tools, some years ago, on the site of a company in Uruguay called Insophia. It was in the list of products developed by them, and that they worked on. Scrapinghub came later.


by proposing paid hosting and support for companies that don't want the burden to manage it themselves ? There could be some additional features with the paid version also


I'm wondering about this but how realistic do you expect someone to pay when it's already free? I would imagine only large enterprise users so essentially you are supporting free users by charging enterprise users that may pay you for support.

Horrible industry imho when you have to give away things for free just to be competitive. I just don't get why people would expect software to be free.


We don't expect anyone to pay for Scrapy or Portia!

We provide the best Platform (as a Service) to run Scrapy or Portia spiders, and will soon be supporting most standard web scraping technologies. This is free for light users, but we charge for people who need extra or dedicated computing or network resources.

We also provide help to startups or enterprise orgs looking to get help in building a web data harvesting system (more than just parsing pages!), either by building it ourselves or by helping our partners train their engineers in using our technologies.

This has worked so far, and we're very healthy from a revenue perspective – more than doubling every year for a few years now, and good enough to grow to become the largest fully distributed company outside of the US.

We're pretty happy with being a brand that gives to the community, it tends to get repaid 10x in the long run.


how much are you generating in terms of revenue? are you venture funded?


Would need to Kickstart or Patreon it.


> an already solved problem

It's a hard problem to generalize.

> balance between people who want to scrape Linkedin to spam people, others looking to do good with the data they scrape, and website owners who get aggressive and threatening when they realize they are getting scraped

Agreed. No one wants to be the bad guy and most clients looking to spam people are awful clients to have anyhow. Btw scraping LinkedIn is fairly difficult/expensive and they like to sue people.


[flagged]


We've banned this account for repeatedly violating the HN guidelines.

We're happy to unban accounts when people give us reason to believe they will post only civil and substantive comments in the future. You're welcome to email hn@ycombinator.com if that's the case.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: