
A Technical and Cultural Assessment of the Mueller Report PDF - mpweiher
https://www.pdfa.org/a-technical-and-cultural-assessment-of-the-mueller-report-pdf/
======
est31
I get why they did this.

Document formats have gotten so complicated that you have no idea whether the
redaction software you use actually does its job or doesn't. Going analog then
back gives you a very good guarantee that you can't get otherwise.

In order for there to be guaranteed no leaks, the redaction software has to be
bug-free. Leaks can be anything, from how much free space there is between
allocated regions to highly precise layout placement information that you can
use to figure out censored words on a trial & error basis if you have a copy
of the used software. So you can't really come up with a watertight formal
definition of leak-freedom, which makes proving that your software removes all
leaks impossible, at least in rich-text documents. The only way I see is to go
full ascii or something.

~~~
nabla9
I wonder how different it is to carry a USB stick from SCIF to SCIF compared
to just moving paper.

I doubt that top secret counterintelligence information in the report can be
retracted in normal office space. Installed software in a SCIF may be highly
limited and out of date.

~~~
MrMorden
SCIFs generally have some sort of TS/SCI network connectivity, so the
appropriate solution would to just use it. But every agency that has SCIFs
wants their own network, because it would be disastrous if TLA #1 could see
TLA #2's cafeteria menus. Congress, the White House, all seventeen IC
agencies, and every customer agency have at least one; and people on one don't
necessarily have the access they need to others (or the cafeteria menus would
be visible, and we can't have that). Given sufficiently pathological
connectivity, it can be easier to just have someone courier a DVD.

There's absolutely no legitimate reason for a computer in a SCIF to have
outdated software. Data diodes exist, and there is no technical obstacle to
setting up a mirror of whatever package repository you like. (Political
obstacles may be non-trivial, because compliance is far more important than
security, and too many members of upper management believe that classified
networks are somehow magically secure.)

------
raphlinus
The gold standard for redaction is to replace the redacted text with nonsense
of similar length. Otherwise, you retain precise metrics for the redacted
text. With guesses or context you can fill those in with high probability. In
an extreme case, the redaction might be a name of let's say a member of
Congress, so the candidates can be narrowed down to a tiny number. There's
considerable work on this, notably the "Declassification Engine" [0]. I
believe it would be possible to apply even better word models (such as GPT-2)
to improve the results even more.

I'm interested in whether this document was redacted in such a military-secure
way, or whether black bars were simply placed over the text. I've reached out
to a couple of news organizations offering my consulting help, but didn't get
a nibble.

[0]: [https://www.newyorker.com/tech/annals-of-technology/the-
decl...](https://www.newyorker.com/tech/annals-of-technology/the-
declassification-engine-reading-between-the-black-bars)

~~~
haasted
> With guesses or context you can fill those in with high probability. In an
> extreme case, the redaction might be a name of let's say a member of
> Congress

Something like this actually happened somewhere in the report. A name was at
the end of a line, causing a part of it to break unto the next line. The few,
black characters present on the second line lead readers to theorise that the
name ends with “jr”.

~~~
woozyolliew
Maybe Mueller figured out who killed JR?

------
sverhagen
What a beautifully geeky article.

I was wondering if they'd want to also print-then-scan the document out of a
fear that redactions can be undone somehow? Stories in my memory about how
older revisions of documents were still retained in the meta (or otherwise not
immediately visible) data of documents (at the very least documents from
Microsoft Office).

~~~
garmaine
Yes, that's basically it. (Source: I've previously worked for the government.)

------
deno
One of the major shortcomings of this simple redaction method, and not
mentioned in the article, is actually from a security point of view.

This is less of an issue with Mueller report but was quite noticeable with
Snowden leaks. If your redactions are of short words or group of words, you
can sometimes make pretty good guesses as to what was redacted. There are all
kind of methods you can employ, statistical etc., to aid with this.

A proper redaction tool could avoid this by varying the length of the
redactions. PDFs even have layout hints now, so it’s not that complicated
technically.

The copy-and-scan security method can be replicated by a flattening pass, that
does the same, without losing any of the accessibility benefits.

~~~
blihp
The problem with that approach is that varying the length of the redacted text
would effectively alter the document and forever muddy the waters as to what
the original text could be. This would likely result in endless speculation
that what might be released at a later date is not, in fact, the original
text. It also makes it much more difficult to assess whether or not fighting
to get one or more parts of redacted text released is worthwhile. (i.e. what
appears to be a name or phrase in one area might not be deemed important vs a
page and a half in another or vice versa)

So let's say Congress takes this to court and gets an order requiring the A.G.
to disclose the contents of one or more sections/types of redactions. There
would likely be little confidence that any future unredacted text they receive
was the original text rather than yet another modified version of it (i.e.
maybe one or more words would be added/removed etc.) By disclosing the exact
length of the text, as imperfect a solution as that is, it makes it likely
that any future alterations can be more easily detected.

------
jakecopp
> In releasing the redacted PDF of the report to the public, Barr avoids
> suspicion that the document had been edited (changed) in addition to
> straightforward redactions. PDF serves the need to unambiguously assure the
> press and the public that they are seeing Mueller's actual report.

This is such a shame. The PDF wasn't even signed.

Is an HTML file with a hash provided directly by Mueller/whoever is doing the
redacting too much to ask?

> There's really no model for redaction of HTML-based web content.

Is replacing censored text with a fixed number of some character not good
enough?

~~~
jstanley
With either a PDF file or an HTML file, there's no way to prove that the
document is a redacted version of the original, rather than an edited-and-
redacted version of the original. I also don't see what using PDF adds here.

~~~
Mirioron
I think it's because many people think you can't edit PDFs, but they know you
can edit text documents.

~~~
gruez
Even if you really can't edit PDFs, there's nothing preventing you from
recreating the document and exporting that as a pdf. After all, it's all text.

------
Klasiaster
A proper standard for redacted documents would be a format where the author
provides a list of ((text, position), signature) entries and the redactor is
free to remove elements of this list when redistributing it. (edit: to show
the deletions, the author would have to provide a singed background shape as
well where the text entries is cut out already)

------
hyperman1
Printers add small invisible yellow dots to pages to make them traceable. Is
this a viable way to find out where these pages were printed?

~~~
est31
Most of them are probably lost due to the heavy compression applied... but
extracting yellow dots from scanned documents is possible in theory. E.g. it
was most likely the reason why Reality Winner got arrested so quickly after
she gave NSA documents to the press:
[https://en.wikipedia.org/wiki/Reality_Winner](https://en.wikipedia.org/wiki/Reality_Winner)

------
dmclamb
Seems AI could be used to reconstruct a portion of the redacted sections of
the document. There are already viable human translations proposed in this
thread. . .

