
How Web Scraping Is Revealing Lobbying and Corruption in Peru - bezzi
https://blog.scrapinghub.com/2016/03/09/how-web-scraping-is-revealing-lobbying-and-corruption-in-peru/
======
kilotaras
I'm from Ukraine and the biggest success in battling corruption comes from
system called Prozorro[1] (transparently) for government tenders.

It started as volunteer project and some projections put savings at around 10%
of total budget after it will become mandatory in April.

[1] [https://github.com/openprocurement/](https://github.com/openprocurement/)

------
carlosp420
Hi there, I am the author of the blog post. I will be happy to answer any
question.

~~~
nsoldiac
Carlos, super buen trabajo, felicitaciones!! Llevo tiempo estudiando temas
relacionados a tecnología vs. corrupción desde acá en Berkeley. Tengo
testimonios interesantes de contactos que han vivido el cambio post-tecnología
en el gobierno. Perú tiene harto potencial en esta área. Si necesitas ayuda en
cualquier momento feliz de apoyarte!

~~~
carlosp420
muchas gracias! En el Perú ya hay varios grupos de periodistas que se han
asociado con programadores para hacer proyectos interesantes de periodismo de
datos. Está Ojo Publico, Convoca e IDL reporteros. Pero igual no nos damos
abasto hay tanto por hacer!

~~~
rjeli
nsoldiac:

Carlos, super good job, congratulations!! I study issues related to technology
vs. corruption from here in Berkeley. I have interesting evidence from
contacts who have experienced post-technology change in government. Peru has
(too much) potential in this area. If you need help at any time I am happy to
support you!

carlosp420:

thank you! In Peru there are already several groups of journalists who have
partnered with developers to make interesting data journalism projects. There
is Ojo Publico [1], Convoca [2] and IDL Reporters [3]. But all the same, it's
not enough, there's so much to do!

[1] [http://ojo-publico.com/](http://ojo-publico.com/) [2]
[http://www.convoca.pe/](http://www.convoca.pe/) [3] [https://idl-
reporteros.pe/](https://idl-reporteros.pe/)

------
ecthiender
Very interesting, how tools like these can be so much helpful for journalists
and generally transparency in government functions.

Probably world changing, when considering that even semi-technical folks can
cook up tools to dig into things like this.

I know this tool was by a developer, but scrapinghub has web UI to make
scrapers.

~~~
unsettledtck
Full disclosure, I work for Scrapinghub and the web UI you speak of is Portia
- our open source visual web scraper. It's for those who range from non-
technical to technical but want a quick way to scrape data. I think it's
extremely important to develop tools to democratize the acquisition of data
regardless of technical background and skill. Glad you find the article and
tool interesting!

~~~
ecthiender
Yes, totally agree with you on the great potential of tools for easy data
acquisition.

I have personally used Scrapy in the past, I find it to be a great tool.

Congratulations on your work!

~~~
unsettledtck
Thank you. Glad you enjoy Scrapy, we're pretty fond of it ourselves!

------
xiphias
Can you draw a covisit graph of people? Who visited the building at the same
times as somebody else. The strength of the connections could be
visitedboth^2/( visitedwithouttheother1+1)*(visitedwithouttheother2+1)))

------
alecco
In other countries, corrupt politicians found out a simple captcha per n items
is good enough to defeat analysis.

~~~
smarx007
[https://anti-captcha.com/](https://anti-captcha.com/) &
[https://rucaptcha.com/](https://rucaptcha.com/) \- I think that can be best
summarised as "from Russia with love" :)

------
danso
FWIW, if you live in the U.S., then you benefit from having such data in great
quantity, though I don't think it's sliced-and-diced to near the potential
that it has:

Lobbyists have to follow registration procedures, and their official
interactions and contributions are posted to an official database that can be
downloaded as bulk XML:

[http://www.senate.gov/legislative/lobbyingdisc.htm#lobbyingd...](http://www.senate.gov/legislative/lobbyingdisc.htm#lobbyingdisc=lda)

Could they lie? Sure, but in the basic analysis that I've done, they generally
don't feel the need to...or rather, things that I would have thought that
lobbyists/causes would hide, they don't. Perhaps the consequences of getting
caught (e.g. in an investigation that discovers a coverup) far outweigh the
annoyance of filing the proper paperwork...having it recorded in a XML
database that few people take the time to parse is probably enough obscurity
for most situations.

There's also the White House visitor database, which _does_ have some outright
admissions, but still contains valuable information if you know how to filter
the columns:

[https://www.whitehouse.gov/briefing-
room/disclosures/visitor...](https://www.whitehouse.gov/briefing-
room/disclosures/visitor-records)

But it's also a case (as it is with most data) where having some political
knowledge is almost as important as being good at data-wrangling. For example,
it's trivial to discover that Rahm Emanuel had few visitors despite is key
role, so you'd have to be able to notice than and then take the extra step to
find out his workaround:

[http://www.nytimes.com/2010/06/25/us/politics/25caribou.html](http://www.nytimes.com/2010/06/25/us/politics/25caribou.html)

And then there are the many bespoke systems and logs you can find if you do a
little research. The FDA, for example, has a calendar of FDA officials'
contacts with outside people...again, it might not contain everything but it's
difficult enough to parse that being able to mine it (and having some domain
knowledge) will still yield interesting insights:
[http://www.fda.gov/NewsEvents/MeetingsConferencesWorkshops/P...](http://www.fda.gov/NewsEvents/MeetingsConferencesWorkshops/PastMeetingsWithFDAOfficials/default.htm)

There's also OIRA, which I haven't ever looked at but seems to have the same
potential of finding underreported links if you have the patience to parse and
text mine it:
[https://www.whitehouse.gov/omb/oira_0910_meetings/](https://www.whitehouse.gov/omb/oira_0910_meetings/)

And of course, there's just the good ol FEC contributions database, which at
least shows you individuals (and who they work for):
[https://github.com/datahoarder/fec_individual_donors](https://github.com/datahoarder/fec_individual_donors)

This is not to undermine what's described in the OP...but just to show how
lucky you are if you're in the U.S. when it comes to dealing with official
records. They don't contain everything perhaps but there's definitely enough
(nevermind what you can obtain through FOIA by being the first person to ask
for things) out there to explore influence and politics without as many
technical hurdles.

~~~
hackuser
Thanks; it's invaluable to hear from someone who has experience with the data.

Do you know what they are required to report? For example, if they have a
'social' dinner with a lobbyist, must that be reported? Are the requirements
the same across the Executive Branch? All three branches?

~~~
danso
I don't have much experience with the lobbying rules except for times that
I've had to research things specifically. Usually disclosure requirements come
with a minimum amount...In the House (not sure if the exact limits apply to
the Senate...), the ethics rules are quite strict but not everything is
recorded...for example, a legislator (or their staff) can only receive $100 of
gifts from a single source in a calendar year..."gifts" being basically
anything of value...but things under $10 don't count toward that limit. So
getting Frappuccinos everyday with your favorite CEO probably wouldn't be
recorded in any official capacity even though not only do those add up
monetarily, but someone getting coffee with a legislator on a frequent basis
would be a huge point of potential influence. However, legislators aren't
allowed to get gifts (such as paid dinners) at all from a registered lobbyist
[1].

Both the House and the Senate have gift travel databases (travel that's
reimbursed by an outside group, such as a charter flight to visit an oil
drilling rig) [2]

The branches differ in how such things are reported...this was pretty obvious
recently when Justice Scalia died at a ranch and people started wondering who
paid for the trip...take one look at how these forms are supplied and it
should be pretty obvious why we don't normally hear about SCOTUS relationships
until something really weird happens [3].

This NYT editorial "So Who's a Lobbyist?" has a nice rundown of the ways that
people who would generally be considered a lobbyist can escape disclosure
requirements: [http://www.nytimes.com/2012/01/27/opinion/so-whos-a-
lobbyist...](http://www.nytimes.com/2012/01/27/opinion/so-whos-a-
lobbyist.html)

Still, it's useful to be able to parse the dataset in an attempt to find
what's missing...something that is difficult to do conceptually unless you're
dealing with the actual dataset on your own system.

[1] [https://ethics.house.gov/gifts/house-gift-
rule](https://ethics.house.gov/gifts/house-gift-rule)

[2]
[http://clerk.house.gov/public_disc/giftTravel.aspx](http://clerk.house.gov/public_disc/giftTravel.aspx)

[3]
[http://pfds.opensecrets.org/N99999918_2008.pdf](http://pfds.opensecrets.org/N99999918_2008.pdf)

------
prawn
Peruvians, do you think this would cause a majority of meetings to be held
outside of public office buildings or via secretive messaging system?

------
dkarp
This is really impressive, even more so by the fact that it has already led to
discoveries being made.

Web scraping is a really powerful tool for increasing transparency on the
internet especially with how transient online data is.

My own project, Transparent[1], has similar goals.

[1] [https://www.transparentmetric.com/](https://www.transparentmetric.com/)

------
Angostura
This is a fascinating project - If successful I suspect the result will be
that lobbying to longer takes place in the government offices: "Shall we meet
at that little place down the street", or will be carried out over the phone.

------
jorgecurio
Really interesting use of data extraction....

For developers and managers out there, do you prefer to build your own in-
house scrapers or use Scrapy or tools like Mozenda instead? What about
import.io and kimono?

I'm asking because lot of developers seem to be adamant against using web
scraping tools they didn't develop themselves. Which seems counter productive
because you are going into technical debt for an already solved problem.

So developers, what is the perfect web scraping tool you envision?

And it's always a fine balance between people who want to scrape Linkedin to
spam people, others looking to do good with the data they scrape, and website
owners who get aggressive and threatening when they realize they are getting
scraped.

It seems like web scraping is a really shitty business to be in and nobody
really wants to pay for it.

~~~
austinhutch
After Kimono got shut down, I think a self-hosted open source version would be
extremely popular. I want to build my own solution, but the API functionality
and pagination / AJAX loaded data would be too difficult.

~~~
jorgecurio
interesting, how would a self-hosted open source version make money tho in
order to support itself and continue to upgrade?

Is this even a realistic business model? Seems like this is what Scrapy is
doing and what Import.io is doing. Make the tool free in order to get free
marketing and then charge people willing to pay money to extract data.

Meanwhile I see Mozenda charging like 5 cents for each page extracted, do you
think this is a fair model or does it not matter?

~~~
unsettledtck
So for Scrapy and Portia, they are both free as in beer, specifically because
we believe in the power of open source. Scrapy actually precedes Scrapinghub
and was certainly not developed as a marketing tool.

Charges come with large scale crawls (above certain limits on our platform),
additional products like Crawlera (our smart downloader that routes requests
from a crawl through a pool of IP addresses to avoid bans), datasets, and for
us to handle complex crawls for companies outsourcing to us.

Our model is that there is something for everyone whether you are looking to
dip your toes into web scraping (free), use it occasionally (usually
journalists) or dependent on web crawling for your business.

~~~
vram22
>Scrapy actually precedes Scrapinghub

Right. I had first come across Scrapy, while browsing the web for Python
software tools, some years ago, on the site of a company in Uruguay called
Insophia. It was in the list of products developed by them, and that they
worked on. Scrapinghub came later.

