Hacker News new | past | comments | ask | show | jobs | submit login
Code to transform Hillary's emails from raw PDF documents to a SQLite database (github.com/benhamner)
94 points by shagunsodhani on Sept 9, 2015 | hide | past | favorite | 42 comments



FWIW, the Wall Street Journal's data team has also posted its repo of code for processing the Clinton emails:

https://github.com/wsjdata/clinton-email-cruncher

Kind of fun to compare methods...converting PDFs can be such drudge work sometimes. I'm interested in the data-cleaning/name-resolution aspect...though that is also another deadly boring programming task.


With zero assurance that all work emails are truly in the dump, I wonder if it will be at all possible to analyze gaps in it (ie. questions by Hilary without a response, regularly recurring interactions missed) to make educated guesses about what confidential emails she purged.


IIRC "Exploring Everyday Things with R and Ruby" By Sau Sheong Chang does something at least similar with the Enron email database.


What a sad moment in US history. "What do you mean? Wipe it with a cloth?" Translation, I have nothing but contempt for the American people. Any one of us would have been in serious legal trouble for much less.

Nice code, BTW. How is pdftotext when compared to PyPDF2?


It is clearly an age of open contempt and disregard for the citizenry by by a political class which sees itself as having a hereditary right to rule.


Think of it as the Petreus doctrine ...


Are there any open source tools that would slurp in content like this and develop its own sense of relationships in the data, that I could then explore by hand?

Bedarra's Text Analyzer[1] kinda floored me and I'd like to use something similar for various tasks, if there was something good and free.

[1] http://www.bedarra.com/movies/textAnalyserMovie.html


> Are there any open source tools that would slurp in content like this ...

Yes, tesseract[1] can do a pretty good job. Here[2] is a blog post which describes using it to perform OCR on PDF's.

As for searching the PDF contents, Solr[3] might be what you are looking for instead.

1 - https://github.com/tesseract-ocr/tesseract

2 - http://fransdejonge.com/2012/04/ocr-text-in-pdf-with-tessera...

3- http://stackoverflow.com/questions/6694327/indexing-pdf-with...


Why are the emails being released as PDF's?


This is standard procedure...the emails go through a vetting process to redact things that are not meant to be released to the public. This can mean anything from classified information to private information that was sent to an official (medical records, personal information in a forwarded application, etc). The PDF workflow is probably the most efficient way to do this...I don't mean efficient in terms of what programmers think of as efficient...I mean efficient in that the vetting/censoring process involves a good number of officials who are reviewing the documents which probably involves some back-and-forth, for which paper is a pretty decent medium for. Also, not all/many of these officials are versed in digital editing workflows [1]. I think in some situations, the emails are printed, physically redacted (i.e. black marker), then scanned and converted to PDF. Keep in mind that many requesters of government documents are perfectly fine with printable documents, even if such documents are not suited for machine parsing.

[1] Government officials have been burned before when using what they thought was a PDF redaction tool...i.e. using the box-drawing-tool and drawing a black box...was not actually redaction...Governor Blagojevich's case comes to mind: http://www.wbez.org/blog/update-blagojevich-lawyers-created-...


How is it not easier to programmatically find/replace specific names/emails and replace with (redacted).

In this day and age it takes a lot more work to print, manually redact and scan than it does to digitally redact permanently. Your story about redacting PDFs is interesting these were not originally PDFs, they were emails.

I can't see any other reason for delivering documents as printed/scanned PDF than to make it more difficult to analyse considering they were originally not paper or PDF. Occams Razor applies here.


> How is it not easier to programmatically find/replace specific names/emails and replace with (redacted).

Because this isn't possible. How are you going to account for any possible misspellings, accidental spacing or other unknown unknowns? You can't. To properly disseminate things the documents have to run through actual people. Sure, people fuck up things too but they can be much more precise when it comes to processing real, human language than any computer right now.

This is the process the government takes. Convert to PDF, redact, release.

> I can't see any other reason for delivering documents as printed/scanned PDF than to make it more difficult to analyse considering they were originally not paper or PDF. Occams Razor applies here.

You are misusing the meaning of occam's razor. This is actually the easiest, most correct way to disseminate possibly sensitive information from the government that they have available. It sucks. If you can write something better then do it; you can make a ton of money and save the government a whole lot of time. But it better be really, really accurate.


> digitally redact permanently

How sure would you be of a tool that let you do this?

"Flattening" the document through a paper medium offers a benefit of reducing the digital trail to the tools used to convert from paper to PDF. Unintended data leaks are minimized.


You're preaching to the wrong person...I think the world would be a better place if everyone could grep/sed/awk their way through textfiles. That said, imagine the agency's situation: They have thousands of documents to review before releasing. Many/most of these emails are releasable as is. But some potentially contain classified information that cannot be known a priori...i.e. it's not as simple as doing a grep for "CLASSIFIED" or "OSAMA". In fact, one of the current controversies with the Clinton email is that they're finding information that should have been treated as classified, but at the time of sending, was not...the conditions for classification aren't often based off of keywords, but on the context and who the sender is.

Occam's Razor falls on the side of the print-to-paper-then-scan-to-PDFs. When you print an email or Word document to paper, you have effectively destroyed all non-visible metadata. When you use a Sharpie to rub out a name, you can be sure that mark is going to propagate to the electronic scan. You have none of those assurances if you work electronically...and imagine being someone who _isn't_ a programmer. Not too long ago the NSA was clusterfucked by a low-level employee who ran wget without anyone noticing. If you're the non-programming bureaucrat in charge of this bespoke redaction process, software seems very arcane and insecure...paper sounds pretty nice in comparison.

But yeah, I do wonder about the amount of human labor and late nights that go into this. But apparently, a lot of our legal system has revolved around legions of paralegals and lawyers sifting through uncountable volumes of paper...and a lot of the people doing the current redaction probably came from that field of work.


I meant Occam's razor as a way of explaining why they were handed over as paper - the simplest explanation being they were intentionally making it harder.

I get why the agency might print it out- but I don't get why the Clinton staffers would print them out. If the government is going to subpoena or raid me, they're just taking my computer. Sure, if it gets to public release, then they might print/redact. But they sure are not going to wait for me to print out stuff and hand it over.

Given the original 'two phones are too difficult' excuse for having the private emails, I can't see how printing out the paper and handing that over can be seen as anything but deliberate delaying. Unless I'm missing something here, it's not like the Clinton staff redacted the mails themselves - they just deleted the ones they didn't want to hand over.

If the agency staff are redacting for public release, then that is a different thing altogether, they could have done that from the original emails.


Maybe they did both.


Because Hillary wants to make it hard to analyze them. Drug companies used the same technique to hide their relationships with doctors.

https://www.propublica.org/nerds/item/doc-dollars-guides-col...


I was the original programmer for that project...while I didn't enjoy spending part of my late 20s learning about PDF to text conversions...I wouldn't necessarily attribute their use of PDF to devious motives. It's easy to forget sometimes that much of the world still operates on paper...and electronic documents that translate directly to paper have some value. These financial records were disclosed as part of legal settlements, and they were also used by hospital compliance officers...While obviously just sending out CSVs or Excel sheets would be great, it adds a possible breakage point between the company and its audience...i.e. hospital officers complaining that the spreadsheet looks funky on their version of Excel or whatever the hell they're using. The thing that makes PDFs so frustrating to parse is what makes them so "secure" and stable to end-users. That's why I advise my students to not just send URLs and web-clips to potential employers, but to have PDF versions of their resumes and work because inevitably, the "boss" wants a printout on his desk...and you don't want his office's shitty printer (and/or whatever IE browser rendered the HTML) to leave a bad first impression.

And let's face it, if you are an end-user who is not a programmer, a PDF is very convenient. You can even do Ctrl-F to look for things...and if you aren't a programmer, that's all the search-power you need to do your compliance work. After the investigation was published, I fielded a lot of questions from auditors who were convinced I had built some amazing proprietary search algorithm...such people just don't know the gap between doing Ctrl-F versus "SELECT * FROM doctors where LAST_NAME like ..." :)

There were a couple of companies who arguably went out of their way to obfuscate their records. They had built websites that consisted of Flash widgets designed to look and act like a (crappy) HTML table...except with Flash, you can prevent the user from select/copy/etc. Unfortunately, the developers of those widgets never told the companies that their Flash containers operated by making XHR requests to external XML/JSON files...so these companies ended up being the easiest to gather the data from.


Handling PDFs and OCR is always troublesome.

But in this case, if I understand you correctly, the original source documents were PDF? That they were distributed as PDF for all the reasons you state?

That doesn't apply in this case - the original documents were not PDF, and had no reason to be PDF except to make it more difficult.


According to news reports Hillary handed over the emails to the Dept. of State as hard copies. These were then processed by them into PDFs including doing OCR.

http://money.cnn.com/2015/03/11/technology/security/hillary-...

Edit: Hilarious part was where State says this is not a problem because "it would have to print Clinton's emails in the normal review process."


She also handed over Outlook database files on a thumb drive to her lawyer, so it's hard to argue that this whole process was somehow necessary.


Fortunately, the PDFs already contain the plain text as metadata. I believe they are what are known as a searchable image PDFs.

The code posted here isn't doing any OCR, but whatever generated the PDFs (Acrobat?) might have.


How did they get the plain text as metadata? Was the scanning equipment doing OCR and setting that?


I believe the U.S. agencies use ABBYY FineReader, which does a pretty good job with OCR and text resolution. The U.S. Senate used it when releasing the CIA torture docs awhile ago:

http://www.nytimes.com/interactive/2014/12/09/world/cia-tort...


Sorry, I accidentally downvoted you when I meant to upvote you! Those arrows, ever so tiny...


It was quite a FU - I read originally they were delivered as printed emails.

Not good form at all.


She probably wants to delay the inevitable.


Awesome. I didn't know the raw data was available anywhere publicly. I thought only journalists/gov had it


Not all of it is. Only small parts of it.


Has anyone created a web interface yet for browsing these emails?


I don't get it. She wasn't supposed to use a private email server because it could be compromised and leak classified information. And now they're releasing the emails to the public. How does this make any sense?


I haven't followed all the details of this saga, but my understanding is that this release is the result of a Freedom of Information Act (FOIA) request for the emails. The emails were vetted by the State Department, and only those that do not have security clearance restrictions are being released to the public.


Part of the controversy is the concern that classified information may not have been stored securely, but another concern is that Clinton may have deleted emails which legally must be preserved (the federal records act).

Compare with the 2013 IRS controversy, which could not be properly investigated because Lois Lerner's hard drive had been erased.


The FOIA request excludes sensitive info, so that's not at risk here. However, had she not used a private server, no one would care.


Python 3!


OT:

"Why do we always refer to female public figures by their first name, and male ones by their last?"

I heard/read that question somewhere, and it stuck with me, can't stop seeing it. And this is a fine example of that.

Google results:

"Clinton email scandal" 41M results.

"Hillary email scandal" 48M results.

Yes, I do realize hers is a special case, and it's necessary to use her first name to disambiguate wrt to Bill. But the phenomenon exists. Just some thing to keep in mind.

[Edit:

"In the tennis commentary, women athletes were called by only their first names 52.7% of the time, while men were referred to by only their first names 7.8% of the time."

http://www.la84.org/gender-stereotyping-in-televised-sports/

]


To be fair, there was another famous Clinton that could make this confusing. I've never heard of Condoleezza Rice, Sarah Palin or Angela Merkel being called by just their first name, although I admit that I don't often hear them being called by just their last name either (maybe except Merkel).


Condoleezza Rice was often referred to as "Condi" in the media. I think a lot of it has to do with information theory: there are two famous Clintons, several famous Rices, and hundreds of famous Sarahs.


I stand corrected on that one, I'd personally only ever heard of her referred to by her full name, but that's just my own ignorance.


And that might be a German peculiarity.

Co-workers who have known each other for longer than they have known their spouses address each other as Herr Schulze and Frau Schmidt.


Because in general, women are less common, so we can use their common name without people wondering which person we're referring to?

I realize that's a weak justification, but it does make some difference... And 41M vs. 48M does not require a very strong justification.

Of course sometimes the issue is that people use first names in order to intentionally avoid showing respect, but that happens to "dubya" and "jeb" and many other public men as well.


is stealing email becoming legal? no matter who they belong to




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: