
Mark Felt-Tipped: Uncovering top-secret information by counting pixels - matt4077
https://matthi.coffee/2018/mark-felt-tipped/
======
aaronbrethorst
The title of the post is a pun: Watergate's "Deep Throat" was a man named Mark
Felt, who was the Associate Director of the FBI. He felt jilted over his lack
of promotion to the Directorship, and responded by leaking to Woodward and
Bernstein. [https://www.vanityfair.com/news/2013/11/watergate-leak-
mark-...](https://www.vanityfair.com/news/2013/11/watergate-leak-mark-felt-
career-ladder)

~~~
astrodust
Disappointed the Marker Felt font didn't make a cameo.

------
guidovranken
It's not different for encrypted data that obscures the content but not its
size. I've written before about how you can sometimes infer the size of a
password in a HTML form submitted over an otherwise fully secure, correctly
implemented SSL (HTTPS) connection. I've also shown how you you can guess the
general location on the earth encoded in GPS coordinates from the size of the
textual representation of those coordinates alone. It's not rocket science to
make this deduction, but it is, I think, nonetheless an under-appreciated
aspect of consumer-grade encryption. See here if you want to read my findings
[https://guidovranken.files.wordpress.com/2015/12/https-
bicyc...](https://guidovranken.files.wordpress.com/2015/12/https-bicycle-
attack.pdf)

~~~
dspillett
> I've written before about how you can sometimes infer the size of a password
> in a HTML form submitted over an otherwise fully secure, correctly
> implemented SSL (HTTPS) connection.

Good catch. Interesting, but obvious once you think about it. Not really a
danger for properly chose passwords/passphrases (the search space would still
be too large for brute forcing) but it would highlight where a brute force
attempt is worth trying.

I might have to add an extra (hidden) input to all our authentication pages
programatically filled with random characters up to a length of 1024 minute
the length of the entered password.

~~~
Terr_
For extra paranoia, pad to a _byte_ limit so that UTF-8 doesn't leak anything
from it's fluctuating length.

IMO if you don't _need_ the entropy, it's probably easier to pad with some
kind of "clearly not legal" character which is still visible when debugging,
such as newlines.

~~~
dspillett
_> IMO if you don't need the entropy_

You almost certainly don't, but my default for any security related value is
"properly random" except where the randomness itself might give clues. That
way around is less likely to result in a "kicking oneself" situation in the
future!

~~~
Terr_
The reason I'm suggesting a detectable padding is for a different "kicking
yourself" scenario: What happens if some system ever _doesn 't_ strip the
junk? Now you've got corrupt data, and its been corrupted in a way that you
cannot reliably fix.

In the case of hashed passwords, suppose the user entered "foo" but you stored
hash("foo42156"). Now the user is locked out, and there's no way for you to
fix it on their next login attempt, because you have no way of knowing how
much of it was "real" anymore.

In contrast, a deterministic system (like "pad with newlines to 256 bytes")
allows you to take their next login attempt, validate it under the _older_
method, and "upgrade" the hash to the correct de-junked version.

It's not just passwords either: The issue of corruption also applies to all
variable-length _non-hashed_ sensitive data you might apply this scheme to.
For example, security-questions, e-mail addresses, financial account numbers,
etc.

~~~
dspillett
Valid point.

In this case though the extra padding would be in a separate field that only
exists to control the length of the POST request body - nothing should be
looking at it in a way that would allow it in to corrupt other data.

~~~
Terr_
Ah, I was interpreting it as randomness/padding on a per-field, basis, ex:

password = "foo2556019562042" # Example limit of 16 chars

password_real_len = 3

------
cypherpunks01
The Washington Post actually ran with this redaction analysis, in "What we
learned from the Democratic response to the Nunes memo — and what we didn’t"
posted a couple hours ago:

Article:
[https://www.washingtonpost.com/news/politics/wp/2018/02/25/w...](https://www.washingtonpost.com/news/politics/wp/2018/02/25/what-
we-learned-from-the-democratic-response-to-the-nunes-memo-and-what-we-didnt/)

Handy GIF:
[https://img.washingtonpost.com/pbox.php?url=https://www.wash...](https://img.washingtonpost.com/pbox.php?url=https://www.washingtonpost.com/news/politics/wp-
content/uploads/sites/11/2018/02/Redact.gif&op=noop)

"By September 2016, the FBI had opened investigations into four members of
Trump’s campaign team. The Democratic memo says the information compiled by
Steele into his infamous “dossier” of 17 raw intelligence reports didn’t get
to the FBI’s counterintelligence team until the middle of September. By that
point, we can conclude thanks to a sloppy redaction (noted by former
intelligence officer Matt Tait) and an unredacted footnote that Page,
Papadopoulos, former Trump campaign chairman Paul Manafort and Michael Flynn,
who would go on to be Trump’s national security adviser, were all already
under investigation."

Matt Tait links to:
[https://twitter.com/pwnallthethings/status/96752319618133606...](https://twitter.com/pwnallthethings/status/967523196181336064)

~~~
zaroth
The FBI actually first interviewed Steele in _July_ 2016 about 25 days before
they opened the investigation.

“Simpson said Steele first shared his concerns with the FBI during the first
week of July 2016 and in a subsequent meeting with the Rome official two
months later when Steele provided the official ‘a full briefing’ of his
findings”

[https://www.usatoday.com/story/news/politics/2018/01/09/doss...](https://www.usatoday.com/story/news/politics/2018/01/09/dossier-
author-told-fbi-had-source-inside-trump-organization/1017938001/)

~~~
flatline
This would appear to have been refuted by the latest release:

[https://www.politico.com/story/2018/02/24/democratic-memo-
go...](https://www.politico.com/story/2018/02/24/democratic-memo-gop-fbi-
trump-campaign-423446)

~~~
zaroth
"The dossier, compiled by former British spy Christopher Steele, wasn't
provided to the FBI's counterintelligence team until mid-September 2016,
according to the memo."

This statement could be perfectly true, while it's also perfectly true Steele
met with the FBI in July, and had multiple other channels to provide
information to various other FBI departments. It doesn't particularly matter
exactly how the investigation started. If you've read the dossier and now
knowing what we know about how it came to be, it's pretty disgusting.

"FBI officials indicated that Steele himself was not advised that the work he
was doing was on behalf of the Clinton campaign."

Now this is something I hadn't heard before! That is absolutely shocking that
we're supposed to believe this ex-Spy was in the dark about who was paying
him?

~~~
flatline
IIRC the genesis of the dossier was opposition research by other Republicans.
We’re talking about intelligence operatives, they probably never actually know
where the money is coming from, and it often flows from different and
sometimes competing sources. These organizations run on secrecy and distrust.

------
cantrevealname
Here on Hacker News four years ago we figured out the redacted name of a
country in a similar way. The Intercept[1] reported that the National Security
Agency was secretly recording the audio of every phone call in the Bahamas and
in one other unnamed country. Looking at the length of country name in the
source document, we figured out that it was Afghanistan[2] based on the length
of the blacked-out area and that it couldn't word wrap to a second line.

[1]
[https://firstlook.org/theintercept/article/2014/05/19/data-p...](https://firstlook.org/theintercept/article/2014/05/19/data-
pirates-caribbean-nsa-recording-every-cell-phone-call-bahamas/)

[2]
[https://news.ycombinator.com/item?id=7768839](https://news.ycombinator.com/item?id=7768839)

------
rafael859
Tom Murphy did the same thing yesterday (on the same part of the text no
less), but he got five instead of four! [1]

[1]
[https://twitter.com/tom7/status/967568358861430785](https://twitter.com/tom7/status/967568358861430785)

~~~
mattnewton
I trust four more, because his alignment seems to be off in the beginning, and
because the space that follows is too short for descriptions of 5 people
assuming that it follows the same format as Carter page (association then full
name).

~~~
lifeformed
Plus there is that little nub that sticks out on the right side of the
redaction, where the end of the "r" in "four" fits in perfectly.

~~~
Bartweiss
I was surprised that wasn't mentioned; in addition to spacing issues it looks
like the redaction literally just stopped too soon.

------
Bucephalus355
Also a good time to mention “Van Eck Radiation”. This is something all CRT
screens, and to a lesser extent, LCDs, emit. If you pick up this radiation and
know the model of the screen being used, you essentially have access to a live
visual of a person’s monitor.

Also worth mentioning that just like the Secret Service has an ink database on
all the printer types in the world, the NSA is supposed to have a database of
what different keyboards sound like. This means that simply by recording the
sound of you typing, they can infer keystrokes / characters. Obviously the
easiest way to record this is by hacking your phone, which is right next to
you.

[https://en.m.wikipedia.org/wiki/Van_Eck_phreaking](https://en.m.wikipedia.org/wiki/Van_Eck_phreaking)

~~~
unimpressive
Actually, the exploit with the keyboards is both more interesting and more
sophisticated than that. As described in _Silence On The Wire_ , essentially
how it works is that English letters are not randomly distributed. If you hear
any given keystroke, you know that the most likely letter to be pressed
knowing nothing else is the letter 'e'. This on its own isn't very helpful,
but you don't type every letter on your keyboard at the same speed. It takes
you ever so slightly longer to type 'z' than 'f'. You of course also only type
one key at a time, as a consequence it's possible to merely follow a procedure
something like this to recover English text given audio of it being typed on a
standard keyboard:

\- Assign a prior probability of letter frequencies based on a corpus of the
language text you're analyzing.

\- Separate the different keystrokes in the audio file into a series of times
between keystrokes. (i.e, have your program recognize one keystroke as
distinct from another)

\- Based on the subtle timing differences between keystrokes, assign various
lengths to different timings.

\- Using the prior probability of the letter frequencies, assign the different
time lengths to different characters based on their frequencies.

\- You now have a straightforward mapping between the distance between two
keystrokes and the character typed, which should allow you to decode the typed
text.

Further calibration can probably be had by considering a word dictionary and
using fuzzy matching to detect how often words are decoded incorrectly and
what the correct decoding would be.

~~~
smrq
That's absolutely fascinating, not to mention a stronger attack. I use a self-
designed and manufactured keyboard, so I'd definitely be immune to a database
of various keyboard designs--mine is one-of-a-kind. But I doubt that the
differences in key layout would be sufficient to thwart the timing attack.

Of course, if you know or suspect that you are under such surveillance, you
could try and alter your typing cadence, e.g. by switching to hunt-and-peck.

~~~
ghettoimp
Perhaps some good music is the perfect defense. It would obscure the sound of
the keyboard, and also make it easy to type with the beat. :)

------
mlady
Using this methodology more generally, it would be interesting to use NLP to
identify the part of speech that is redacted to narrow the word search space.

In this example, the POS would be an adjective, and since the subject noun is
plural, it would be more likely the adjective is a number

~~~
abpavel
Or better yet, rank the resulting sentences for each of the possible fits
using existing speech API, setting a cutoff to filter out nonsensical results.
This might even yield surprises.

~~~
FLUX-YOU
You can't really predict if there's going to be an aside in the redacted
sentence though.

------
nabla9
Consider the opposite problem: What if you want to create document that
retains redacted information.

Some kind of Reed–Solomon type encoding in the typography that would allow
retrieving the whole document even after it has been redacted and copied.

~~~
jobigoud
You could use steganography to store the redacted parts.

~~~
mulmen
Wouldn't that change the non-redacted parts of the document? Or would there be
a cover page with a story about a boy and his dog on every FOIA request?

------
pja
Note to self: print all documents in mono-spaced Courier New if I intend to
ever redact any contents.

~~~
pavel_lishin
But then you'd be giving away the exact length of the redacted text. In this
case, you'd be narrowing the options down to "four", "five", and "nine" \-
reasonable, but longer text blocks would be more susceptible to analysis.

~~~
autokad
thats a good point, what about random fonts per character? sucks when you get
a wingding, but hey

~~~
craftyguy
> sucks when you get a wingding, but hey

redaction by wingding could be a feature

------
eschutte2
In the example shown, it's obvious it's four, since you can see the right edge
of the letter sticking out past the black mark.

~~~
mhb
Or n6, nn6, n7 or nn7.

------
cypherpunks01
Looks like a job for dynamic programming!

It's a classic packing problem - searching for phrases of exact pixel width,
where each of the unknown number of letters contributes a different number of
pixels.

~~~
senatorobama
Is it even tractable?

------
weaksauce
Footnote seven in the next paragraph lists the four people anyway and also un-
redacts that redaction.

~~~
cypherpunks01
Any sense of the redacted word in that next paragraph?

~~~
weaksauce
tough to say but the most likely answer is that it's either four with extra
space redacted or the word "several" or maybe "seven" as that would be 3 more
and not a stretch to imagine there are 3 more in this admin that would have
been under investigation?

------
Moodles
This reminds me of the New York times redaction fail:
[https://www.techdirt.com/articles/20140128/08542126021/new-y...](https://www.techdirt.com/articles/20140128/08542126021/new-
york-times-suffers-redaction-failure-exposes-name-nsa-agent-targeted-network-
uploaded-pdf.shtml)

~~~
mattnewton
Someone else tried that (posted further down)
[https://twitter.com/tom7/status/967568358861430785](https://twitter.com/tom7/status/967568358861430785)

I think it is also unlikely given the space that follows too which would read
more naturally if it was name + title, but it could be if the page was just
off too much (I think the article here does a better job aligning than this
guy did from looking at the beginning of the sentence.)

------
philip1209
If it is felt pen-redacted and then scanned, wouldn't there be some detectable
difference between the parts of the paper with and without underlying ink?

~~~
anfilt
Yes if you have the orginal. However, the scans usually adjust the contrast so
much that saved image is not grayscale or colour. Its either black or white.

------
mc32
Why a numeral exactly and not an u defined quantity like "some". Is there
additional context to eliminate the possibility of imprecise quantifiers?

~~~
netsharc
Here's what "four" and "some" look like:
[https://i.imgur.com/UNSpv6v.png](https://i.imgur.com/UNSpv6v.png) . Different
letter widths and kerning make this identification possible.

I wonder how brute-forceable it is. Given a blank with length of half a line,
can I just throw 20-30 combination of the alphabet and punctuation, and figure
out which combinations match, and the anagram for it?

Actually, brute-forcing using words from a dictionary (English words plus
names of the involved people (Americans, and obviously Russians)) would make
it go even faster.

~~~
OkGoDoIt
In theory kerning differences would make letter ordering significant, so this
becomes a permutation problem not a combination problem, and therefore much
harder to brute force.

~~~
alkonaut
Is kerning used also across spaces? I.e is the space between two words
dependent on which letter the first word ends with, and which letter the
second word starts with?

If not, it still seems tractable using dictionary words.

~~~
blattimwind
> Is kerning used also across spaces?

Yes.

~~~
alkonaut
Ok. Makes it much more difficult. Still, assuming grammatically valid
sequences of dictionary words, kerning would be known both within and between
words. These texts however probably contain lots of abbreviations, footnote
symbols, numbers, brackets etc, that make it likely to be a lot harder than
just regular prose from dictionary words.

------
arca_vorago
Just fyi the government became aware of this issue in about ~2010 and changed
the policy on how redactions were done, changing from blackout to whitebox
with black edges that were overlayed on the text with small size variations to
prevent the attack. Old documents though that have already been released are
likely to be ripe for this kind of analysis.

~~~
onychomys
This one was released over the weekend, though, so apparently somebody didn't
get the memo.

~~~
Bartweiss
I'm guessing Congressional memo redactions, especially these days, aren't
quite up to the standards of intelligence agencies that have to redact and
release for FOIA on a daily basis.

------
mhb
Why were digits ruled out? Like "100"?

~~~
mlady
maybe there's some standard style guide that rules that out?

I would be interested in seeing how many "matches" digits would yield

~~~
astura
I don't know if it's an official style somewhere or not, but I was always
taught 1-9 you spell out (one, two, three,...) And 10+ you write digits (10,
11, 12, 13,...). With a few exceptions, like if two numbers are next to each
other.

------
crafty
Interesting, but operating systems & graphics applications conform typefaces
to different subpixel grids. 17pt Times in Pixelmator on a Mac is not 17pt
Times on Photoshop on Windows, and certainly not 17pt Times on a scanned
document printed on an undisclosed substrate by an unknown printer. Similarly,
typeface tracking adds a significant margin of error to this interpretation
method. Metrics can vary by application and some applications will
automatically provide optical tracking adjustment for a better aesthetic. It's
for these reasons that I'm convinced that anything other than trivial (in
terms of LOE) & obvious disclosures won't be made using this methodology, and
we can't universally trust the results.

~~~
lokopodium
They achieved 100% match on the unredacted text with Word. Empirically
speaking, they've proven their software generates the same output as FBI's.

------
mikeash
Seems like if you truly care about hiding the redacted information, the only
choice is to manually retype the document with the redacted text replaces with
something that has no information content, like the literal text “REDACTED.”

Why don’t they do this?

~~~
skygazer
Probably to maintain the sense of boundedness of the withheld.

How big is the secret they're hiding? Is it of unlimited or limited length?
Your technique would obscure the sense of scale.

~~~
mikeash
You could add multiple instances of “REDACTED” to show approximately how much
stuff is missing I’d thats important.

------
ixtli
I wonder whether or not there should be a rule on HN about linking to
classified information. I am an extremely harsh critic of the american
security state and the IC in general, but lower level employees can be fired
or even charged with crimes for viewing things they're not supposed to and I
don't think its right to disregard the effect on the individual. Obviously, in
this case, it's only a single word, but i've seen leaked docs here before.

~~~
sneak
If you’re extremely harsh on the IC in general, then you wouldn’t be saying
this. Those people chose to get those clearances, and voluntarily agreed to
ridiculous and insane terms like “reading things other people say I can’t is
now grounds to put me in jail”.

Such people do not deserve special workarounds by the rest of society because
of their terribly poor judgement.

~~~
willstrafach
Reading classified things will not be grounds to put anyone in jail. Not sure
where you got that.

~~~
ixtli
I have some old friends that work in the american intelligence community (CIA,
NRO, etc.) and they've explained to me that you consent to a lot of things
when you join the military and get clearance that don't apply to civilians. I
asked why the Army/Navy completely block access to wikileaks, and I was told
it's because people can "get in serious trouble" for looking at things that
they don't have clearance for so blocking makes it much less likely there'll
be a mistake when sites that host that content are all over the news.

~~~
astura
Leaking classified material doesn't make it unclassified, leaked documents are
still classified material and must be treated as such if you have a security
clearance - but only if you had/have a clearance. That means not only not
accessing it if you don't have the proper clearance and need-to-know, but also
never transmitting it over unclassified channels or download or storing it
onto an unclassified computer.

Since its never authorized to access classified information from an
unclassified network, it makes sense to block access to classified materials
from unclassified networks.

------
andrewstellman
Matt Tait (infosec researcher at the Strauss Center at UT Austin) had some
insight into this:
[https://twitter.com/pwnallthethings/status/96752319618133606...](https://twitter.com/pwnallthethings/status/967523196181336064)
– it's worth reading his first few comments, he has some good thoughts on
using fixed with fonts in redacted documents.

------
eccbits
While this is fun technologically, is there an onligation for the author to
think of the negative impact? Certainly this isn't hekling the US.

------
salgernon
So, given the very small list of possible names associated with the subject of
this paragraph, I should think the number of permutations would be in the low
hundreds. I’d be willing to bet the larger block falls to this analysis by
morning in Washington. (Not that that is necessarily the most important block,
but it probably has the bounded scope.)

------
skosch
_I am working on automating this process and trying to find similar hits for
the remaining redactions._

I was thinking that too, but it characterizes Carter Page as a "former
campaign foreign policy advisor". The other names may have similar
designations, making that a hopeless game.

We do know, from an unredacted footnote, that one of them was Michael Flynn.
So that makes 2/4.

~~~
mattnewton
No points for the others either: Manafort and Gates have also been indicted
publically, and names and titles for each probably fit in the redacted
section.

------
Smushman
Long thought this should be possible... I am very impressed you figured out a
way to do it.

You could iterate the strategy further by creating a full font from the
existing text, identifying spacing and etc for each next letter/character,
calculate for word wrap lengths, and get it even closer.

~~~
pacaro
It’s probably most effective to identify the software and font that produced
the original, exact positioning of text is complex: you’d need to identify
kerning pairs, spacing rules within lines, sub pixel positioning (where
appropriate), etc.

------
gravypod
I wonder if someone could train a DNN to uncensor images of similarly redacted
information

------
kw71
Why can't this be presented without requiring JavaScript???

~~~
jwilk
Here's a copy that works without JS:

[https://gist.github.com/anonymous/37200394d975e1a86ee9d19bef...](https://gist.github.com/anonymous/37200394d975e1a86ee9d19bef0cf2e9)

------
colinbartlett
Could you use some kind of kerning randomizer to prevent this?

------
Angostura
The joke's on him. The number is actually numeric, and I work it out to be
8423

~~~
StavrosK
You can see the right tip of the "r" past the redacted part.

------
ohiovr
How high of a resolution were these scans?

------
paulcole
Blacking out text with a sharpie isn’t tradecraft.

~~~
timb07
My understanding is that the text should be cut out with scissors instead.

------
tzahola
Can we use this approach to recover the missing 18 minutes of the Watergate
tapes?

