Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Mark Felt-Tipped: Uncovering top-secret information by counting pixels (matthi.coffee)
369 points by matt4077 on Feb 25, 2018 | hide | past | favorite | 126 comments


The title of the post is a pun: Watergate's "Deep Throat" was a man named Mark Felt, who was the Associate Director of the FBI. He felt jilted over his lack of promotion to the Directorship, and responded by leaking to Woodward and Bernstein. https://www.vanityfair.com/news/2013/11/watergate-leak-mark-...


Disappointed the Marker Felt font didn't make a cameo.


It's not different for encrypted data that obscures the content but not its size. I've written before about how you can sometimes infer the size of a password in a HTML form submitted over an otherwise fully secure, correctly implemented SSL (HTTPS) connection. I've also shown how you you can guess the general location on the earth encoded in GPS coordinates from the size of the textual representation of those coordinates alone. It's not rocket science to make this deduction, but it is, I think, nonetheless an under-appreciated aspect of consumer-grade encryption. See here if you want to read my findings https://guidovranken.files.wordpress.com/2015/12/https-bicyc...


> I've written before about how you can sometimes infer the size of a password in a HTML form submitted over an otherwise fully secure, correctly implemented SSL (HTTPS) connection.

Good catch. Interesting, but obvious once you think about it. Not really a danger for properly chose passwords/passphrases (the search space would still be too large for brute forcing) but it would highlight where a brute force attempt is worth trying.

I might have to add an extra (hidden) input to all our authentication pages programatically filled with random characters up to a length of 1024 minute the length of the entered password.


For extra paranoia, pad to a byte limit so that UTF-8 doesn't leak anything from it's fluctuating length.

IMO if you don't need the entropy, it's probably easier to pad with some kind of "clearly not legal" character which is still visible when debugging, such as newlines.


> IMO if you don't need the entropy

You almost certainly don't, but my default for any security related value is "properly random" except where the randomness itself might give clues. That way around is less likely to result in a "kicking oneself" situation in the future!


The reason I'm suggesting a detectable padding is for a different "kicking yourself" scenario: What happens if some system ever doesn't strip the junk? Now you've got corrupt data, and its been corrupted in a way that you cannot reliably fix.

In the case of hashed passwords, suppose the user entered "foo" but you stored hash("foo42156"). Now the user is locked out, and there's no way for you to fix it on their next login attempt, because you have no way of knowing how much of it was "real" anymore.

In contrast, a deterministic system (like "pad with newlines to 256 bytes") allows you to take their next login attempt, validate it under the older method, and "upgrade" the hash to the correct de-junked version.

It's not just passwords either: The issue of corruption also applies to all variable-length non-hashed sensitive data you might apply this scheme to. For example, security-questions, e-mail addresses, financial account numbers, etc.


Valid point.

In this case though the extra padding would be in a separate field that only exists to control the length of the POST request body - nothing should be looking at it in a way that would allow it in to corrupt other data.


Ah, I was interpreting it as randomness/padding on a per-field, basis, ex:

password = "foo2556019562042" # Example limit of 16 chars

password_real_len = 3


The Washington Post actually ran with this redaction analysis, in "What we learned from the Democratic response to the Nunes memo — and what we didn’t" posted a couple hours ago:

Article: https://www.washingtonpost.com/news/politics/wp/2018/02/25/w...

Handy GIF: https://img.washingtonpost.com/pbox.php?url=https://www.wash...

"By September 2016, the FBI had opened investigations into four members of Trump’s campaign team. The Democratic memo says the information compiled by Steele into his infamous “dossier” of 17 raw intelligence reports didn’t get to the FBI’s counterintelligence team until the middle of September. By that point, we can conclude thanks to a sloppy redaction (noted by former intelligence officer Matt Tait) and an unredacted footnote that Page, Papadopoulos, former Trump campaign chairman Paul Manafort and Michael Flynn, who would go on to be Trump’s national security adviser, were all already under investigation."

Matt Tait links to: https://twitter.com/pwnallthethings/status/96752319618133606...


The FBI actually first interviewed Steele in July 2016 about 25 days before they opened the investigation.

“Simpson said Steele first shared his concerns with the FBI during the first week of July 2016 and in a subsequent meeting with the Rome official two months later when Steele provided the official ‘a full briefing’ of his findings”

https://www.usatoday.com/story/news/politics/2018/01/09/doss...


This would appear to have been refuted by the latest release:

https://www.politico.com/story/2018/02/24/democratic-memo-go...


"The dossier, compiled by former British spy Christopher Steele, wasn't provided to the FBI's counterintelligence team until mid-September 2016, according to the memo."

This statement could be perfectly true, while it's also perfectly true Steele met with the FBI in July, and had multiple other channels to provide information to various other FBI departments. It doesn't particularly matter exactly how the investigation started. If you've read the dossier and now knowing what we know about how it came to be, it's pretty disgusting.

"FBI officials indicated that Steele himself was not advised that the work he was doing was on behalf of the Clinton campaign."

Now this is something I hadn't heard before! That is absolutely shocking that we're supposed to believe this ex-Spy was in the dark about who was paying him?


IIRC the genesis of the dossier was opposition research by other Republicans. We’re talking about intelligence operatives, they probably never actually know where the money is coming from, and it often flows from different and sometimes competing sources. These organizations run on secrecy and distrust.


He knew exactly who was paying him, he just may not have known who was paying those people.


> we can conclude thanks to a sloppy redaction

So the reasonable question that follows is: was the sloppiness intentional, or a leak?


Hanlon's Razor would seem to be a decent explanation.


"All remaining explanations being of equal probability, the stupidest one is the most likely explanation."


Here on Hacker News four years ago we figured out the redacted name of a country in a similar way. The Intercept[1] reported that the National Security Agency was secretly recording the audio of every phone call in the Bahamas and in one other unnamed country. Looking at the length of country name in the source document, we figured out that it was Afghanistan[2] based on the length of the blacked-out area and that it couldn't word wrap to a second line.

[1] https://firstlook.org/theintercept/article/2014/05/19/data-p...

[2] https://news.ycombinator.com/item?id=7768839


Tom Murphy did the same thing yesterday (on the same part of the text no less), but he got five instead of four! [1]

[1] https://twitter.com/tom7/status/967568358861430785


I trust four more, because his alignment seems to be off in the beginning, and because the space that follows is too short for descriptions of 5 people assuming that it follows the same format as Carter page (association then full name).


Plus there is that little nub that sticks out on the right side of the redaction, where the end of the "r" in "four" fits in perfectly.


I was surprised that wasn't mentioned; in addition to spacing issues it looks like the redaction literally just stopped too soon.


Also a good time to mention “Van Eck Radiation”. This is something all CRT screens, and to a lesser extent, LCDs, emit. If you pick up this radiation and know the model of the screen being used, you essentially have access to a live visual of a person’s monitor.

Also worth mentioning that just like the Secret Service has an ink database on all the printer types in the world, the NSA is supposed to have a database of what different keyboards sound like. This means that simply by recording the sound of you typing, they can infer keystrokes / characters. Obviously the easiest way to record this is by hacking your phone, which is right next to you.

https://en.m.wikipedia.org/wiki/Van_Eck_phreaking


Actually, the exploit with the keyboards is both more interesting and more sophisticated than that. As described in Silence On The Wire, essentially how it works is that English letters are not randomly distributed. If you hear any given keystroke, you know that the most likely letter to be pressed knowing nothing else is the letter 'e'. This on its own isn't very helpful, but you don't type every letter on your keyboard at the same speed. It takes you ever so slightly longer to type 'z' than 'f'. You of course also only type one key at a time, as a consequence it's possible to merely follow a procedure something like this to recover English text given audio of it being typed on a standard keyboard:

- Assign a prior probability of letter frequencies based on a corpus of the language text you're analyzing.

- Separate the different keystrokes in the audio file into a series of times between keystrokes. (i.e, have your program recognize one keystroke as distinct from another)

- Based on the subtle timing differences between keystrokes, assign various lengths to different timings.

- Using the prior probability of the letter frequencies, assign the different time lengths to different characters based on their frequencies.

- You now have a straightforward mapping between the distance between two keystrokes and the character typed, which should allow you to decode the typed text.

Further calibration can probably be had by considering a word dictionary and using fuzzy matching to detect how often words are decoded incorrectly and what the correct decoding would be.


That's absolutely fascinating, not to mention a stronger attack. I use a self-designed and manufactured keyboard, so I'd definitely be immune to a database of various keyboard designs--mine is one-of-a-kind. But I doubt that the differences in key layout would be sufficient to thwart the timing attack.

Of course, if you know or suspect that you are under such surveillance, you could try and alter your typing cadence, e.g. by switching to hunt-and-peck.


Perhaps some good music is the perfect defense. It would obscure the sound of the keyboard, and also make it easy to type with the beat. :)


Or switch to alternate keyboard layouts that have dissimilar key distances & locations. Using Dvorak/Colemak etc would throw this method off unless it managed to retrain for the particular differences due to layouts.

Maybe a random layout generator that would mask this effect and only cost typing speed for the extra defense?


Using this methodology more generally, it would be interesting to use NLP to identify the part of speech that is redacted to narrow the word search space.

In this example, the POS would be an adjective, and since the subject noun is plural, it would be more likely the adjective is a number


You might get some out of predicting the first and last words and word type of a redaction (based on the words next to the redaction), but it's only cutting down on brute force space.

That makes short redactions more dangerous for declassification than entire paragraphs as a general rule because you have no context to start from, but that's probably common sense to people doing redactions.


Or better yet, rank the resulting sentences for each of the possible fits using existing speech API, setting a cutoff to filter out nonsensical results. This might even yield surprises.


You can't really predict if there's going to be an aside in the redacted sentence though.


[flagged]


NLP: Natural language processing. Part of speech (POS): the elements that, when combined, form a sentence (adjectives, verb, substantive, etc).

What the OP suggests is to run existing tools to identify what POS are missing (as in "in this blank space you could only have a substantive and two to three adjectives"), and use that to reduce the search space.


Consider the opposite problem: What if you want to create document that retains redacted information.

Some kind of Reed–Solomon type encoding in the typography that would allow retrieving the whole document even after it has been redacted and copied.


This was the goal of The Underhanded C Contest in 2008.

http://www.underhanded-c.org/_page_id_17.html


Random thought: using serifs to encode additional bits with a Reed-Solomon would allow for, say, one bit per letter (either we use the serif or the sans-serif version). Assuming latin characters, it would allow for t/2 = 1/16 (6,25%) of the original text - a bit lower if we consider that letters like "o" do not have serifs, somewhat higher if we remove easily-guessable words like "the", "an" and so on.

That's a good idea, I might do a quick Show HN about that.


If that's the case, I think the easiest thing to do would be to transcribe the document and include annotations inline about where content was omitted.

e.g. Go from "XXXX people connected with.." to "[REDACTED] people connected with". So long as it's noted that redaction happened and one doesn't have the source document to glean things from, that would probably be sufficient to block the things the author of this article is doing.

What I'm wondering, though, is if this is permissible. Being able to see how much of a document was redacted gives us some important context:

1. We know we're looking at the "original" report. This might not be the case with an approved transcription of some sort. 2. It's easy for a layman to get a sense of how much information is being withheld.

Knowing what and how much was redacted seems like it's pretty important for maintaining trust and transparency with an organization that has legitimate reasons to withhold some amount of information.


At that point just leak it anonymously somehow. If you do something this cute everyone will know the person doing the reacting effectively leaked it intentionally. Why not just take a picture, strip the metadata, and pass it to a trusted journalist?


You could use steganography to store the redacted parts.


Wouldn't that change the non-redacted parts of the document? Or would there be a cover page with a story about a boy and his dog on every FOIA request?


Note to self: print all documents in mono-spaced Courier New if I intend to ever redact any contents.


It seems like instead you need to manually replace all redacted text with a fixed-width placeholder like `[REDACTED]`


        Due          to         new regulations         all   documents        must          be      padded           .


if it's not a random ammount per word it can still be guessed though, and even then it wont work unless they use a monospace font.


But then you'd be giving away the exact length of the redacted text. In this case, you'd be narrowing the options down to "four", "five", and "nine" - reasonable, but longer text blocks would be more susceptible to analysis.


thats a good point, what about random fonts per character? sucks when you get a wingding, but hey


> sucks when you get a wingding, but hey

redaction by wingding could be a feature


That seems like it would make text incredibly difficult for a human to read.


Yes, you’d have to replace the blacked out text with a fixed length string too. "REDACTED" perhaps.


What about random noise in whitespace length?


Monospace fonts only reduce this channel. If you want to redact properly, you'd need to replace every redacted section with a fixed-length placeholder.


Monospace tells you precisely how many characters were redacted. A proportional font with random kerning is better, but typographers would probably murder you in your sleep.


Mixed fonts...


Clip out letters from magazines and paste them onto craft paper. Plenty of precedence.

"iF YOu wAnT tO SEe yOUR deMOCRacY aGAIn lEAvE 5 miLLioN dOLlArs uNDer THE bRidGe AT ..."


That always seemed bizarre to me. I get that printers, typewriters, and handwriting can all be identified, but why not just grab a common bic pen and hold it in your fist like a four year old, then print? I don't see how that could be matched to anything, and surely there's less room to screw up than when you're pasting out of a bunch of magazines and newspapers.


Agreed. I always figured this was a meme that made it easier for the police to track you down (just do some forensics on the magazine paper, fonts, etc., and locate someone with matching subscriptions).

"Hey boss, we got a subscriber to Mad Bomber Monthly, Torture World, and BYTE."

(flipFlipFLip) "One of these three guys?"


If I drank milk, it would have shot out of my nose.


Please take money!

We will throw in another 5 if you also take your closet-case VP with you!


You also need to do padding.


In the example shown, it's obvious it's four, since you can see the right edge of the letter sticking out past the black mark.


Or n6, nn6, n7 or nn7.


Looks like a job for dynamic programming!

It's a classic packing problem - searching for phrases of exact pixel width, where each of the unknown number of letters contributes a different number of pixels.


Is it even tractable?


Footnote seven in the next paragraph lists the four people anyway and also un-redacts that redaction.


Any sense of the redacted word in that next paragraph?


tough to say but the most likely answer is that it's either four with extra space redacted or the word "several" or maybe "seven" as that would be 3 more and not a stretch to imagine there are 3 more in this admin that would have been under investigation?


This reminds me of the New York times redaction fail: https://www.techdirt.com/articles/20140128/08542126021/new-y...


Someone else tried that (posted further down) https://twitter.com/tom7/status/967568358861430785

I think it is also unlikely given the space that follows too which would read more naturally if it was name + title, but it could be if the page was just off too much (I think the article here does a better job aligning than this guy did from looking at the beginning of the sentence.)


If it is felt pen-redacted and then scanned, wouldn't there be some detectable difference between the parts of the paper with and without underlying ink?


Yes if you have the orginal. However, the scans usually adjust the contrast so much that saved image is not grayscale or colour. Its either black or white.


Why a numeral exactly and not an u defined quantity like "some". Is there additional context to eliminate the possibility of imprecise quantifiers?


Here's what "four" and "some" look like: https://i.imgur.com/UNSpv6v.png . Different letter widths and kerning make this identification possible.

I wonder how brute-forceable it is. Given a blank with length of half a line, can I just throw 20-30 combination of the alphabet and punctuation, and figure out which combinations match, and the anagram for it?

Actually, brute-forcing using words from a dictionary (English words plus names of the involved people (Americans, and obviously Russians)) would make it go even faster.


In theory kerning differences would make letter ordering significant, so this becomes a permutation problem not a combination problem, and therefore much harder to brute force.


But valid sentence fragments are not random draws from an alphabet, they're drawn from a dictionary, which makes it much easier.


Is kerning used also across spaces? I.e is the space between two words dependent on which letter the first word ends with, and which letter the second word starts with?

If not, it still seems tractable using dictionary words.


> Is kerning used also across spaces?

Yes.


Ok. Makes it much more difficult. Still, assuming grammatically valid sequences of dictionary words, kerning would be known both within and between words. These texts however probably contain lots of abbreviations, footnote symbols, numbers, brackets etc, that make it likely to be a lot harder than just regular prose from dictionary words.


Did you mean this becomes a combination problem not a permutation problem?


It's been a while since my last discreet math class so take this with a grain of salt, but in a combination you only care about selecting the correct members of a set, so order does not mater. In a permutation problem you care about selecting the correct members as well as their order, so it is a significantly larger solution space which expands factorially.


Thanks for pointing tjat out. So if they are to leave the original layout --not padding the spacing or subtracting spacing asf. then they should at a minimum use monospaced typefaces.


It stands to reason that a precise quantifier would be much more likely to be considered sensitive enough to redact than an imprecise one.


Why would they redact an undefined quantity?


Exactly, these aren't randomly selected words for redaction. The fact that someone thought the information was worth hiding means gives a lot of information regarding its possible contents.


Just fyi the government became aware of this issue in about ~2010 and changed the policy on how redactions were done, changing from blackout to whitebox with black edges that were overlayed on the text with small size variations to prevent the attack. Old documents though that have already been released are likely to be ripe for this kind of analysis.


This one was released over the weekend, though, so apparently somebody didn't get the memo.


I'm guessing Congressional memo redactions, especially these days, aren't quite up to the standards of intelligence agencies that have to redact and release for FOIA on a daily basis.


Why were digits ruled out? Like "100"?


A two-digit number (i. e. "13") would be too short for the redaction. A three- or four-digit number is highly unlikely ( because it would be treated as a single conspiracy, not as separate investigations of individuals). Plus, the sentence that follows after the colon seems to be a list of the individuals, and its length points at a single-digit number.


maybe there's some standard style guide that rules that out?

I would be interested in seeing how many "matches" digits would yield


I don't know if it's an official style somewhere or not, but I was always taught 1-9 you spell out (one, two, three,...) And 10+ you write digits (10, 11, 12, 13,...). With a few exceptions, like if two numbers are next to each other.


Interesting, but operating systems & graphics applications conform typefaces to different subpixel grids. 17pt Times in Pixelmator on a Mac is not 17pt Times on Photoshop on Windows, and certainly not 17pt Times on a scanned document printed on an undisclosed substrate by an unknown printer. Similarly, typeface tracking adds a significant margin of error to this interpretation method. Metrics can vary by application and some applications will automatically provide optical tracking adjustment for a better aesthetic. It's for these reasons that I'm convinced that anything other than trivial (in terms of LOE) & obvious disclosures won't be made using this methodology, and we can't universally trust the results.


They achieved 100% match on the unredacted text with Word. Empirically speaking, they've proven their software generates the same output as FBI's.


Seems like if you truly care about hiding the redacted information, the only choice is to manually retype the document with the redacted text replaces with something that has no information content, like the literal text “REDACTED.”

Why don’t they do this?


Probably to maintain the sense of boundedness of the withheld.

How big is the secret they're hiding? Is it of unlimited or limited length? Your technique would obscure the sense of scale.


You could add multiple instances of “REDACTED” to show approximately how much stuff is missing I’d thats important.


I wonder whether or not there should be a rule on HN about linking to classified information. I am an extremely harsh critic of the american security state and the IC in general, but lower level employees can be fired or even charged with crimes for viewing things they're not supposed to and I don't think its right to disregard the effect on the individual. Obviously, in this case, it's only a single word, but i've seen leaked docs here before.


Nothing has been leaked here. The source document has been publicly released.

Insufficient care has been taken in redacting parts of the document. Responsibility for that falls to whoever redacted the document, not for whoever was able to deduce the redacted content.


Ah. I wasn't aware that the source document had already been released. But whether or not someone could get in trouble for viewing an un-redacted doc is an open question afaict. The quality of the redaction probably doesn't matter.


I agree with you regarding this in general, maybe a tag would be useful, but in this case the president has declassified the source document. The attempts to guess what was redacted would not count as classified information.


Absolutely. This wasn't the best hill to plant this flag on but it's still something I think the community should be aware of. It was more applicable when people were linking directly to wikileaks dumps for thousands of indexed, TS docs.


If you’re extremely harsh on the IC in general, then you wouldn’t be saying this. Those people chose to get those clearances, and voluntarily agreed to ridiculous and insane terms like “reading things other people say I can’t is now grounds to put me in jail”.

Such people do not deserve special workarounds by the rest of society because of their terribly poor judgement.


Nah i just differentiate between the thing and its components. Would we see it as some great victory towards disarming the surveillance state if someone clicked on a link and happened to get caught for it? Thats a very "spray'n'pray" sort of approach which I think could do more damage than it's worth.


Reading classified things will not be grounds to put anyone in jail. Not sure where you got that.


I have some old friends that work in the american intelligence community (CIA, NRO, etc.) and they've explained to me that you consent to a lot of things when you join the military and get clearance that don't apply to civilians. I asked why the Army/Navy completely block access to wikileaks, and I was told it's because people can "get in serious trouble" for looking at things that they don't have clearance for so blocking makes it much less likely there'll be a mistake when sites that host that content are all over the news.


Leaking classified material doesn't make it unclassified, leaked documents are still classified material and must be treated as such if you have a security clearance - but only if you had/have a clearance. That means not only not accessing it if you don't have the proper clearance and need-to-know, but also never transmitting it over unclassified channels or download or storing it onto an unclassified computer.

Since its never authorized to access classified information from an unclassified network, it makes sense to block access to classified materials from unclassified networks.


I understand your concerns, but we can't not publicize the information, that would violate the public interest. These leaks have little effect if they are not broadly distributed.

However, out of concern for individual workers in the bureaucracy, I would support a tag on such stories to warn them before clicking.

EDIT: On reading the other comments here, I don't think even a tag is warranted for this story as no leaking has occurred.


I guess a "NSFW" tag technically fits - it's not something 100% safe to view at work.


I would prefer something like (LEAK) or (CLASSIFIED LEAK). NSFW means that many ordinary people that aren't subject to the logical contradictions of the classification system won't look because they think it's porn or something.


NSFW isn't enough, if you possess a security clearance it's not safe to view at home either as 1) you can't access classified information if you don't have a need-to-know and 2) you aren't authorized to access classified information from an unclassified computer.


Matt Tait (infosec researcher at the Strauss Center at UT Austin) had some insight into this: https://twitter.com/pwnallthethings/status/96752319618133606... – it's worth reading his first few comments, he has some good thoughts on using fixed with fonts in redacted documents.


While this is fun technologically, is there an onligation for the author to think of the negative impact? Certainly this isn't hekling the US.


So, given the very small list of possible names associated with the subject of this paragraph, I should think the number of permutations would be in the low hundreds. I’d be willing to bet the larger block falls to this analysis by morning in Washington. (Not that that is necessarily the most important block, but it probably has the bounded scope.)


I am working on automating this process and trying to find similar hits for the remaining redactions.

I was thinking that too, but it characterizes Carter Page as a "former campaign foreign policy advisor". The other names may have similar designations, making that a hopeless game.

We do know, from an unredacted footnote, that one of them was Michael Flynn. So that makes 2/4.


No points for the others either: Manafort and Gates have also been indicted publically, and names and titles for each probably fit in the redacted section.


Long thought this should be possible... I am very impressed you figured out a way to do it.

You could iterate the strategy further by creating a full font from the existing text, identifying spacing and etc for each next letter/character, calculate for word wrap lengths, and get it even closer.


It’s probably most effective to identify the software and font that produced the original, exact positioning of text is complex: you’d need to identify kerning pairs, spacing rules within lines, sub pixel positioning (where appropriate), etc.


I wonder if someone could train a DNN to uncensor images of similarly redacted information


Why can't this be presented without requiring JavaScript???



Could you use some kind of kerning randomizer to prevent this?


The joke's on him. The number is actually numeric, and I work it out to be 8423


You can see the right tip of the "r" past the redacted part.


How high of a resolution were these scans?


Blacking out text with a sharpie isn’t tradecraft.


My understanding is that the text should be cut out with scissors instead.


Can we use this approach to recover the missing 18 minutes of the Watergate tapes?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: