This is standard procedure...the emails go through a vetting process to redact things that are not meant to be released to the public. This can mean anything from classified information to private information that was sent to an official (medical records, personal information in a forwarded application, etc). The PDF workflow is probably the most efficient way to do this...I don't mean efficient in terms of what programmers think of as efficient...I mean efficient in that the vetting/censoring process involves a good number of officials who are reviewing the documents which probably involves some back-and-forth, for which paper is a pretty decent medium for. Also, not all/many of these officials are versed in digital editing workflows [1]. I think in some situations, the emails are printed, physically redacted (i.e. black marker), then scanned and converted to PDF. Keep in mind that many requesters of government documents are perfectly fine with printable documents, even if such documents are not suited for machine parsing.
[1] Government officials have been burned before when using what they thought was a PDF redaction tool...i.e. using the box-drawing-tool and drawing a black box...was not actually redaction...Governor Blagojevich's case comes to mind: http://www.wbez.org/blog/update-blagojevich-lawyers-created-...
How is it not easier to programmatically find/replace specific names/emails and replace with (redacted).
In this day and age it takes a lot more work to print, manually redact and scan than it does to digitally redact permanently. Your story about redacting PDFs is interesting these were not originally PDFs, they were emails.
I can't see any other reason for delivering documents as printed/scanned PDF than to make it more difficult to analyse considering they were originally not paper or PDF. Occams Razor applies here.
> How is it not easier to programmatically find/replace specific names/emails and replace with (redacted).
Because this isn't possible. How are you going to account for any possible misspellings, accidental spacing or other unknown unknowns? You can't. To properly disseminate things the documents have to run through actual people. Sure, people fuck up things too but they can be much more precise when it comes to processing real, human language than any computer right now.
This is the process the government takes. Convert to PDF, redact, release.
> I can't see any other reason for delivering documents as printed/scanned PDF than to make it more difficult to analyse considering they were originally not paper or PDF. Occams Razor applies here.
You are misusing the meaning of occam's razor. This is actually the easiest, most correct way to disseminate possibly sensitive information from the government that they have available. It sucks. If you can write something better then do it; you can make a ton of money and save the government a whole lot of time. But it better be really, really accurate.
How sure would you be of a tool that let you do this?
"Flattening" the document through a paper medium offers a benefit of reducing the digital trail to the tools used to convert from paper to PDF. Unintended data leaks are minimized.
You're preaching to the wrong person...I think the world would be a better place if everyone could grep/sed/awk their way through textfiles. That said, imagine the agency's situation: They have thousands of documents to review before releasing. Many/most of these emails are releasable as is. But some potentially contain classified information that cannot be known a priori...i.e. it's not as simple as doing a grep for "CLASSIFIED" or "OSAMA". In fact, one of the current controversies with the Clinton email is that they're finding information that should have been treated as classified, but at the time of sending, was not...the conditions for classification aren't often based off of keywords, but on the context and who the sender is.
Occam's Razor falls on the side of the print-to-paper-then-scan-to-PDFs. When you print an email or Word document to paper, you have effectively destroyed all non-visible metadata. When you use a Sharpie to rub out a name, you can be sure that mark is going to propagate to the electronic scan. You have none of those assurances if you work electronically...and imagine being someone who _isn't_ a programmer. Not too long ago the NSA was clusterfucked by a low-level employee who ran wget without anyone noticing. If you're the non-programming bureaucrat in charge of this bespoke redaction process, software seems very arcane and insecure...paper sounds pretty nice in comparison.
But yeah, I do wonder about the amount of human labor and late nights that go into this. But apparently, a lot of our legal system has revolved around legions of paralegals and lawyers sifting through uncountable volumes of paper...and a lot of the people doing the current redaction probably came from that field of work.
I meant Occam's razor as a way of explaining why they were handed over as paper - the simplest explanation being they were intentionally making it harder.
I get why the agency might print it out- but I don't get why the Clinton staffers would print them out. If the government is going to subpoena or raid me, they're just taking my computer. Sure, if it gets to public release, then they might print/redact. But they sure are not going to wait for me to print out stuff and hand it over.
Given the original 'two phones are too difficult' excuse for having the private emails, I can't see how printing out the paper and handing that over can be seen as anything but deliberate delaying. Unless I'm missing something here, it's not like the Clinton staff redacted the mails themselves - they just deleted the ones they didn't want to hand over.
If the agency staff are redacting for public release, then that is a different thing altogether, they could have done that from the original emails.
I was the original programmer for that project...while I didn't enjoy spending part of my late 20s learning about PDF to text conversions...I wouldn't necessarily attribute their use of PDF to devious motives. It's easy to forget sometimes that much of the world still operates on paper...and electronic documents that translate directly to paper have some value. These financial records were disclosed as part of legal settlements, and they were also used by hospital compliance officers...While obviously just sending out CSVs or Excel sheets would be great, it adds a possible breakage point between the company and its audience...i.e. hospital officers complaining that the spreadsheet looks funky on their version of Excel or whatever the hell they're using. The thing that makes PDFs so frustrating to parse is what makes them so "secure" and stable to end-users. That's why I advise my students to not just send URLs and web-clips to potential employers, but to have PDF versions of their resumes and work because inevitably, the "boss" wants a printout on his desk...and you don't want his office's shitty printer (and/or whatever IE browser rendered the HTML) to leave a bad first impression.
And let's face it, if you are an end-user who is not a programmer, a PDF is very convenient. You can even do Ctrl-F to look for things...and if you aren't a programmer, that's all the search-power you need to do your compliance work. After the investigation was published, I fielded a lot of questions from auditors who were convinced I had built some amazing proprietary search algorithm...such people just don't know the gap between doing Ctrl-F versus "SELECT * FROM doctors where LAST_NAME like ..." :)
There were a couple of companies who arguably went out of their way to obfuscate their records. They had built websites that consisted of Flash widgets designed to look and act like a (crappy) HTML table...except with Flash, you can prevent the user from select/copy/etc. Unfortunately, the developers of those widgets never told the companies that their Flash containers operated by making XHR requests to external XML/JSON files...so these companies ended up being the easiest to gather the data from.
But in this case, if I understand you correctly, the original source documents were PDF? That they were distributed as PDF for all the reasons you state?
That doesn't apply in this case - the original documents were not PDF, and had no reason to be PDF except to make it more difficult.
According to news reports Hillary handed over the emails to the Dept. of State as hard copies. These were then processed by them into PDFs including doing OCR.
I believe the U.S. agencies use ABBYY FineReader, which does a pretty good job with OCR and text resolution. The U.S. Senate used it when releasing the CIA torture docs awhile ago: