
Code to transform Hillary's emails from raw PDF documents to a SQLite database - shagunsodhani
https://github.com/benhamner/hillary-clinton-emails
======
danso
FWIW, the Wall Street Journal's data team has also posted its repo of code for
processing the Clinton emails:

[https://github.com/wsjdata/clinton-email-
cruncher](https://github.com/wsjdata/clinton-email-cruncher)

Kind of fun to compare methods...converting PDFs can be such drudge work
sometimes. I'm interested in the data-cleaning/name-resolution aspect...though
that is also another deadly boring programming task.

------
sambrand
With zero assurance that all work emails are truly in the dump, I wonder if it
will be at all possible to analyze gaps in it (ie. questions by Hilary without
a response, regularly recurring interactions missed) to make educated guesses
about what confidential emails she purged.

~~~
triggercut
IIRC "Exploring Everyday Things with R and Ruby" By Sau Sheong Chang does
something at least similar with the Enron email database.

------
rebootthesystem
What a sad moment in US history. "What do you mean? Wipe it with a cloth?"
Translation, I have nothing but contempt for the American people. Any one of
us would have been in serious legal trouble for much less.

Nice code, BTW. How is pdftotext when compared to PyPDF2?

~~~
generic_user
It is clearly an age of open contempt and disregard for the citizenry by by a
political class which sees itself as having a hereditary right to rule.

------
tlack
Are there any open source tools that would slurp in content like this and
develop its own sense of relationships in the data, that I could then explore
by hand?

Bedarra's Text Analyzer[1] kinda floored me and I'd like to use something
similar for various tasks, if there was something good and free.

[1]
[http://www.bedarra.com/movies/textAnalyserMovie.html](http://www.bedarra.com/movies/textAnalyserMovie.html)

~~~
AdieuToLogic
> Are there any open source tools that would slurp in content like this ...

Yes, tesseract[1] can do a pretty good job. Here[2] is a blog post which
describes using it to perform OCR on PDF's.

As for searching the PDF contents, Solr[3] might be what you are looking for
instead.

1 - [https://github.com/tesseract-ocr/tesseract](https://github.com/tesseract-
ocr/tesseract)

2 - [http://fransdejonge.com/2012/04/ocr-text-in-pdf-with-
tessera...](http://fransdejonge.com/2012/04/ocr-text-in-pdf-with-tesseract/)

3- [http://stackoverflow.com/questions/6694327/indexing-pdf-
with...](http://stackoverflow.com/questions/6694327/indexing-pdf-with-solr)

------
honksillet
Why are the emails being released as PDF's?

~~~
danso
This is standard procedure...the emails go through a vetting process to redact
things that are not meant to be released to the public. This can mean anything
from classified information to private information that was sent to an
official (medical records, personal information in a forwarded application,
etc). The PDF workflow is probably the most efficient way to do this...I don't
mean efficient in terms of what programmers think of as efficient...I mean
efficient in that the vetting/censoring process involves a good number of
officials who are reviewing the documents which probably involves some back-
and-forth, for which paper is a pretty decent medium for. Also, not all/many
of these officials are versed in digital editing workflows [1]. I think in
some situations, the emails are printed, _physically_ redacted (i.e. black
marker), then scanned and converted to PDF. Keep in mind that many requesters
of government documents are perfectly fine with printable documents, even if
such documents are not suited for machine parsing.

[1] Government officials have been burned before when using what they
_thought_ was a PDF redaction tool...i.e. using the box-drawing-tool and
drawing a black box...was not actually redaction...Governor Blagojevich's case
comes to mind: [http://www.wbez.org/blog/update-blagojevich-lawyers-
created-...](http://www.wbez.org/blog/update-blagojevich-lawyers-created-pdf-
document-faulty-redacted-statements)

~~~
brc
How is it not easier to programmatically find/replace specific names/emails
and replace with (redacted).

In this day and age it takes a lot more work to print, manually redact and
scan than it does to digitally redact permanently. Your story about redacting
PDFs is interesting these were not originally PDFs, they were emails.

I can't see any other reason for delivering documents as printed/scanned PDF
than to make it more difficult to analyse considering they were originally not
paper or PDF. Occams Razor applies here.

~~~
danso
You're preaching to the wrong person...I think the world would be a better
place if everyone could grep/sed/awk their way through textfiles. That said,
imagine the agency's situation: They have thousands of documents to review
before releasing. Many/most of these emails are releasable as is. But some
potentially contain classified information that cannot be known a
priori...i.e. it's not as simple as doing a grep for "CLASSIFIED" or "OSAMA".
In fact, one of the current controversies with the Clinton email is that
they're finding information that _should_ have been treated as classified, but
at the time of sending, was not...the conditions for classification aren't
often based off of keywords, but on the context and who the sender is.

Occam's Razor falls on the side of the print-to-paper-then-scan-to-PDFs. When
you print an email or Word document to paper, you have effectively destroyed
all non-visible metadata. When you use a Sharpie to rub out a name, you can be
sure that mark is going to propagate to the electronic scan. You have none of
those assurances if you work electronically...and imagine being someone who
_isn't_ a programmer. Not too long ago the NSA was clusterfucked by a low-
level employee who ran wget without anyone noticing. If you're the non-
programming bureaucrat in charge of this bespoke redaction process, software
seems very arcane and insecure...paper sounds pretty nice in comparison.

But yeah, I do wonder about the amount of human labor and late nights that go
into this. But apparently, a lot of our legal system has revolved around
legions of paralegals and lawyers sifting through uncountable volumes of
paper...and a lot of the people doing the current redaction probably came from
that field of work.

~~~
brc
I meant Occam's razor as a way of explaining why they were handed over as
paper - the simplest explanation being they were intentionally making it
harder.

I get why the agency might print it out- but I don't get why the Clinton
staffers would print them out. If the government is going to subpoena or raid
me, they're just taking my computer. Sure, if it gets to public release, then
they might print/redact. But they sure are not going to wait for me to print
out stuff and hand it over.

Given the original 'two phones are too difficult' excuse for having the
private emails, I can't see how printing out the paper and handing that over
can be seen as anything but deliberate delaying. Unless I'm missing something
here, it's not like the Clinton staff redacted the mails themselves - they
just deleted the ones they didn't want to hand over.

If the agency staff are redacting for public release, then that is a different
thing altogether, they could have done that from the original emails.

------
azinman2
Awesome. I didn't know the raw data was available anywhere publicly. I thought
only journalists/gov had it

~~~
generic_user
Not all of it is. Only small parts of it.

------
hoju
Has anyone created a web interface yet for browsing these emails?

------
tiglionabbit
I don't get it. She wasn't supposed to use a private email server because it
could be compromised and leak classified information. And now they're
releasing the emails to the public. How does this make any sense?

~~~
nhebb
I haven't followed all the details of this saga, but my understanding is that
this release is the result of a Freedom of Information Act (FOIA) request for
the emails. The emails were vetted by the State Department, and only those
that do not have security clearance restrictions are being released to the
public.

------
burritofanatic
Python 3!

------
mahmud
OT:

"Why do we always refer to female public figures by their first name, and male
ones by their last?"

I heard/read that question somewhere, and it stuck with me, can't stop seeing
it. And this is a fine example of that.

Google results:

"Clinton email scandal" 41M results.

"Hillary email scandal" 48M results.

Yes, I _do_ realize hers is a special case, and it's necessary to use her
first name to disambiguate wrt to Bill. But the phenomenon exists. Just some
thing to keep in mind.

[Edit:

" _In the tennis commentary, women athletes were called by only their first
names 52.7% of the time, while men were referred to by only their first names
7.8% of the time._ "

[http://www.la84.org/gender-stereotyping-in-televised-
sports/](http://www.la84.org/gender-stereotyping-in-televised-sports/)

]

~~~
Leszek
To be fair, there was another famous Clinton that could make this confusing.
I've never heard of Condoleezza Rice, Sarah Palin or Angela Merkel being
called by just their first name, although I admit that I don't often hear them
being called by just their last name either (maybe except Merkel).

~~~
qq66
Condoleezza Rice was often referred to as "Condi" in the media. I think a lot
of it has to do with information theory: there are two famous Clintons,
several famous Rices, and hundreds of famous Sarahs.

~~~
Leszek
I stand corrected on that one, I'd personally only ever heard of her referred
to by her full name, but that's just my own ignorance.

------
zac1944
is stealing email becoming legal? no matter who they belong to

