The title of the post is a pun: Watergate's "Deep Throat" was a man named Mark Felt, who was the Associate Director of the FBI. He felt jilted over his lack of promotion to the Directorship, and responded by leaking to Woodward and Bernstein. https://www.vanityfair.com/news/2013/11/watergate-leak-mark-...
It's not different for encrypted data that obscures the content but not its size. I've written before about how you can sometimes infer the size of a password in a HTML form submitted over an otherwise fully secure, correctly implemented SSL (HTTPS) connection. I've also shown how you you can guess the general location on the earth encoded in GPS coordinates from the size of the textual representation of those coordinates alone. It's not rocket science to make this deduction, but it is, I think, nonetheless an under-appreciated aspect of consumer-grade encryption. See here if you want to read my findings https://guidovranken.files.wordpress.com/2015/12/https-bicyc...
> I've written before about how you can sometimes infer the size of a password in a HTML form submitted over an otherwise fully secure, correctly implemented SSL (HTTPS) connection.
Good catch. Interesting, but obvious once you think about it. Not really a danger for properly chose passwords/passphrases (the search space would still be too large for brute forcing) but it would highlight where a brute force attempt is worth trying.
I might have to add an extra (hidden) input to all our authentication pages programatically filled with random characters up to a length of 1024 minute the length of the entered password.
For extra paranoia, pad to a byte limit so that UTF-8 doesn't leak anything from it's fluctuating length.
IMO if you don't need the entropy, it's probably easier to pad with some kind of "clearly not legal" character which is still visible when debugging, such as newlines.
You almost certainly don't, but my default for any security related value is "properly random" except where the randomness itself might give clues. That way around is less likely to result in a "kicking oneself" situation in the future!
The reason I'm suggesting a detectable padding is for a different "kicking yourself" scenario: What happens if some system ever doesn't strip the junk? Now you've got corrupt data, and its been corrupted in a way that you cannot reliably fix.
In the case of hashed passwords, suppose the user entered "foo" but you stored hash("foo42156"). Now the user is locked out, and there's no way for you to fix it on their next login attempt, because you have no way of knowing how much of it was "real" anymore.
In contrast, a deterministic system (like "pad with newlines to 256 bytes") allows you to take their next login attempt, validate it under the older method, and "upgrade" the hash to the correct de-junked version.
It's not just passwords either: The issue of corruption also applies to all variable-length non-hashed sensitive data you might apply this scheme to. For example, security-questions, e-mail addresses, financial account numbers, etc.
In this case though the extra padding would be in a separate field that only exists to control the length of the POST request body - nothing should be looking at it in a way that would allow it in to corrupt other data.
The Washington Post actually ran with this redaction analysis, in "What we learned from the Democratic response to the Nunes memo — and what we didn’t" posted a couple hours ago:
"By September 2016, the FBI had opened investigations into four members of Trump’s campaign team. The Democratic memo says the information compiled by Steele into his infamous “dossier” of 17 raw intelligence reports didn’t get to the FBI’s counterintelligence team until the middle of September. By that point, we can conclude thanks to a sloppy redaction (noted by former intelligence officer Matt Tait) and an unredacted footnote that Page, Papadopoulos, former Trump campaign chairman Paul Manafort and Michael Flynn, who would go on to be Trump’s national security adviser, were all already under investigation."
The FBI actually first interviewed Steele in July 2016 about 25 days before they opened the investigation.
“Simpson said Steele first shared his concerns with the FBI during the first week of July 2016 and in a subsequent meeting with the Rome official two months later when Steele provided the official ‘a full briefing’ of his findings”
"The dossier, compiled by former British spy Christopher Steele, wasn't provided to the FBI's counterintelligence team until mid-September 2016, according to the memo."
This statement could be perfectly true, while it's also perfectly true Steele met with the FBI in July, and had multiple other channels to provide information to various other FBI departments. It doesn't particularly matter exactly how the investigation started. If you've read the dossier and now knowing what we know about how it came to be, it's pretty disgusting.
"FBI officials indicated that Steele himself was not advised that the work he was doing was on behalf of the Clinton campaign."
Now this is something I hadn't heard before! That is absolutely shocking that we're supposed to believe this ex-Spy was in the dark about who was paying him?
IIRC the genesis of the dossier was opposition research by other Republicans. We’re talking about intelligence operatives, they probably never actually know where the money is coming from, and it often flows from different and sometimes competing sources. These organizations run on secrecy and distrust.
Here on Hacker News four years ago we figured out the redacted name of a country in a similar way. The Intercept[1] reported that the National Security Agency was secretly recording the audio of every phone call in the Bahamas and in one other unnamed country. Looking at the length of country name in the source document, we figured out that it was Afghanistan[2] based on the length of the blacked-out area and that it couldn't word wrap to a second line.
I trust four more, because his alignment seems to be off in the beginning, and because the space that follows is too short for descriptions of 5 people assuming that it follows the same format as Carter page (association then full name).
Also a good time to mention “Van Eck Radiation”. This is something all CRT screens, and to a lesser extent, LCDs, emit. If you pick up this radiation and know the model of the screen being used, you essentially have access to a live visual of a person’s monitor.
Also worth mentioning that just like the Secret Service has an ink database on all the printer types in the world, the NSA is supposed to have a database of what different keyboards sound like. This means that simply by recording the sound of you typing, they can infer keystrokes / characters. Obviously the easiest way to record this is by hacking your phone, which is right next to you.
Actually, the exploit with the keyboards is both more interesting and more sophisticated than that. As described in Silence On The Wire, essentially how it works is that English letters are not randomly distributed. If you hear any given keystroke, you know that the most likely letter to be pressed knowing nothing else is the letter 'e'. This on its own isn't very helpful, but you don't type every letter on your keyboard at the same speed. It takes you ever so slightly longer to type 'z' than 'f'. You of course also only type one key at a time, as a consequence it's possible to merely follow a procedure something like this to recover English text given audio of it being typed on a standard keyboard:
- Assign a prior probability of letter frequencies based on a corpus of the language text you're analyzing.
- Separate the different keystrokes in the audio file into a series of times between keystrokes. (i.e, have your program recognize one keystroke as distinct from another)
- Based on the subtle timing differences between keystrokes, assign various lengths to different timings.
- Using the prior probability of the letter frequencies, assign the different time lengths to different characters based on their frequencies.
- You now have a straightforward mapping between the distance between two keystrokes and the character typed, which should allow you to decode the typed text.
Further calibration can probably be had by considering a word dictionary and using fuzzy matching to detect how often words are decoded incorrectly and what the correct decoding would be.
That's absolutely fascinating, not to mention a stronger attack. I use a self-designed and manufactured keyboard, so I'd definitely be immune to a database of various keyboard designs--mine is one-of-a-kind. But I doubt that the differences in key layout would be sufficient to thwart the timing attack.
Of course, if you know or suspect that you are under such surveillance, you could try and alter your typing cadence, e.g. by switching to hunt-and-peck.
Or switch to alternate keyboard layouts that have dissimilar key distances & locations. Using Dvorak/Colemak etc would throw this method off unless it managed to retrain for the particular differences due to layouts.
Maybe a random layout generator that would mask this effect and only cost typing speed for the extra defense?
Using this methodology more generally, it would be interesting to use NLP to identify the part of speech that is redacted to narrow the word search space.
In this example, the POS would be an adjective, and since the subject noun is plural, it would be more likely the adjective is a number
You might get some out of predicting the first and last words and word type of a redaction (based on the words next to the redaction), but it's only cutting down on brute force space.
That makes short redactions more dangerous for declassification than entire paragraphs as a general rule because you have no context to start from, but that's probably common sense to people doing redactions.
Or better yet, rank the resulting sentences for each of the possible fits using existing speech API, setting a cutoff to filter out nonsensical results. This might even yield surprises.
NLP: Natural language processing. Part of speech (POS): the elements that, when combined, form a sentence (adjectives, verb, substantive, etc).
What the OP suggests is to run existing tools to identify what POS are missing (as in "in this blank space you could only have a substantive and two to three adjectives"), and use that to reduce the search space.
Random thought: using serifs to encode additional bits with a Reed-Solomon would allow for, say, one bit per letter (either we use the serif or the sans-serif version). Assuming latin characters, it would allow for t/2 = 1/16 (6,25%) of the original text - a bit lower if we consider that letters like "o" do not have serifs, somewhat higher if we remove easily-guessable words like "the", "an" and so on.
That's a good idea, I might do a quick Show HN about that.
If that's the case, I think the easiest thing to do would be to transcribe the document and include annotations inline about where content was omitted.
e.g. Go from "XXXX people connected with.." to "[REDACTED] people connected with". So long as it's noted that redaction happened and one doesn't have the source document to glean things from, that would probably be sufficient to block the things the author of this article is doing.
What I'm wondering, though, is if this is permissible. Being able to see how much of a document was redacted gives us some important context:
1. We know we're looking at the "original" report. This might not be the case with an approved transcription of some sort.
2. It's easy for a layman to get a sense of how much information is being withheld.
Knowing what and how much was redacted seems like it's pretty important for maintaining trust and transparency with an organization that has legitimate reasons to withhold some amount of information.
At that point just leak it anonymously somehow. If you do something this cute everyone will know the person doing the reacting effectively leaked it intentionally. Why not just take a picture, strip the metadata, and pass it to a trusted journalist?
Wouldn't that change the non-redacted parts of the document? Or would there be a cover page with a story about a boy and his dog on every FOIA request?
But then you'd be giving away the exact length of the redacted text. In this case, you'd be narrowing the options down to "four", "five", and "nine" - reasonable, but longer text blocks would be more susceptible to analysis.
Monospace fonts only reduce this channel. If you want to redact properly, you'd need to replace every redacted section with a fixed-length placeholder.
Monospace tells you precisely how many characters were redacted. A proportional font with random kerning is better, but typographers would probably murder you in your sleep.
That always seemed bizarre to me. I get that printers, typewriters, and handwriting can all be identified, but why not just grab a common bic pen and hold it in your fist like a four year old, then print? I don't see how that could be matched to anything, and surely there's less room to screw up than when you're pasting out of a bunch of magazines and newspapers.
Agreed. I always figured this was a meme that made it easier for the police to track you down (just do some forensics on the magazine paper, fonts, etc., and locate someone with matching subscriptions).
"Hey boss, we got a subscriber to Mad Bomber Monthly, Torture World, and BYTE."
It's a classic packing problem - searching for phrases of exact pixel width, where each of the unknown number of letters contributes a different number of pixels.
tough to say but the most likely answer is that it's either four with extra space redacted or the word "several" or maybe "seven" as that would be 3 more and not a stretch to imagine there are 3 more in this admin that would have been under investigation?
I think it is also unlikely given the space that follows too which would read more naturally if it was name + title, but it could be if the page was just off too much (I think the article here does a better job aligning than this guy did from looking at the beginning of the sentence.)
If it is felt pen-redacted and then scanned, wouldn't there be some detectable difference between the parts of the paper with and without underlying ink?
Yes if you have the orginal. However, the scans usually adjust the contrast so much that saved image is not grayscale or colour. Its either black or white.
Here's what "four" and "some" look like: https://i.imgur.com/UNSpv6v.png .
Different letter widths and kerning make this identification possible.
I wonder how brute-forceable it is. Given a blank with length of half a line, can I just throw 20-30 combination of the alphabet and punctuation, and figure out which combinations match, and the anagram for it?
Actually, brute-forcing using words from a dictionary (English words plus names of the involved people (Americans, and obviously Russians)) would make it go even faster.
In theory kerning differences would make letter ordering significant, so this becomes a permutation problem not a combination problem, and therefore much harder to brute force.
Is kerning used also across spaces? I.e is the space between two words dependent on which letter the first word ends with, and which letter the second word starts with?
If not, it still seems tractable using dictionary words.
Ok. Makes it much more difficult. Still, assuming grammatically valid sequences of dictionary words, kerning would be known both within and between words. These texts however probably contain lots of abbreviations, footnote symbols, numbers, brackets etc, that make it likely to be a lot harder than just regular prose from dictionary words.
It's been a while since my last discreet math class so take this with a grain of salt, but in a combination you only care about selecting the correct members of a set, so order does not mater. In a permutation problem you care about selecting the correct members as well as their order, so it is a significantly larger solution space which expands factorially.
Thanks for pointing tjat out.
So if they are to leave the original layout --not padding the spacing or subtracting spacing asf. then they should at a minimum use monospaced typefaces.
Exactly, these aren't randomly selected words for redaction. The fact that someone thought the information was worth hiding means gives a lot of information regarding its possible contents.
Just fyi the government became aware of this issue in about ~2010 and changed the policy on how redactions were done, changing from blackout to whitebox with black edges that were overlayed on the text with small size variations to prevent the attack. Old documents though that have already been released are likely to be ripe for this kind of analysis.
I'm guessing Congressional memo redactions, especially these days, aren't quite up to the standards of intelligence agencies that have to redact and release for FOIA on a daily basis.
A two-digit number (i. e. "13") would be too short for the redaction. A three- or four-digit number is highly unlikely ( because it would be treated as a single conspiracy, not as separate investigations of individuals). Plus, the sentence that follows after the colon seems to be a list of the individuals, and its length points at a single-digit number.
I don't know if it's an official style somewhere or not, but I was always taught 1-9 you spell out (one, two, three,...) And 10+ you write digits (10, 11, 12, 13,...). With a few exceptions, like if two numbers are next to each other.
Interesting, but operating systems & graphics applications conform typefaces to different subpixel grids. 17pt Times in Pixelmator on a Mac is not 17pt Times on Photoshop on Windows, and certainly not 17pt Times on a scanned document printed on an undisclosed substrate by an unknown printer. Similarly, typeface tracking adds a significant margin of error to this interpretation method. Metrics can vary by application and some applications will automatically provide optical tracking adjustment for a better aesthetic. It's for these reasons that I'm convinced that anything other than trivial (in terms of LOE) & obvious disclosures won't be made using this methodology, and we can't universally trust the results.
Seems like if you truly care about hiding the redacted information, the only choice is to manually retype the document with the redacted text replaces with something that has no information content, like the literal text “REDACTED.”
I wonder whether or not there should be a rule on HN about linking to classified information. I am an extremely harsh critic of the american security state and the IC in general, but lower level employees can be fired or even charged with crimes for viewing things they're not supposed to and I don't think its right to disregard the effect on the individual. Obviously, in this case, it's only a single word, but i've seen leaked docs here before.
Nothing has been leaked here. The source document has been publicly released.
Insufficient care has been taken in redacting parts of the document. Responsibility for that falls to whoever redacted the document, not for whoever was able to deduce the redacted content.
Ah. I wasn't aware that the source document had already been released. But whether or not someone could get in trouble for viewing an un-redacted doc is an open question afaict. The quality of the redaction probably doesn't matter.
I agree with you regarding this in general, maybe a tag would be useful, but in this case the president has declassified the source document. The attempts to guess what was redacted would not count as classified information.
Absolutely. This wasn't the best hill to plant this flag on but it's still something I think the community should be aware of. It was more applicable when people were linking directly to wikileaks dumps for thousands of indexed, TS docs.
If you’re extremely harsh on the IC in general, then you wouldn’t be saying this. Those people chose to get those clearances, and voluntarily agreed to ridiculous and insane terms like “reading things other people say I can’t is now grounds to put me in jail”.
Such people do not deserve special workarounds by the rest of society because of their terribly poor judgement.
Nah i just differentiate between the thing and its components. Would we see it as some great victory towards disarming the surveillance state if someone clicked on a link and happened to get caught for it? Thats a very "spray'n'pray" sort of approach which I think could do more damage than it's worth.
I have some old friends that work in the american intelligence community (CIA, NRO, etc.) and they've explained to me that you consent to a lot of things when you join the military and get clearance that don't apply to civilians. I asked why the Army/Navy completely block access to wikileaks, and I was told it's because people can "get in serious trouble" for looking at things that they don't have clearance for so blocking makes it much less likely there'll be a mistake when sites that host that content are all over the news.
Leaking classified material doesn't make it unclassified, leaked documents are still classified material and must be treated as such if you have a security clearance - but only if you had/have a clearance. That means not only not accessing it if you don't have the proper clearance and need-to-know, but also never transmitting it over unclassified channels or download or storing it onto an unclassified computer.
Since its never authorized to access classified information from an unclassified network, it makes sense to block access to classified materials from unclassified networks.
I understand your concerns, but we can't not publicize the information, that would violate the public interest. These leaks have little effect if they are not broadly distributed.
However, out of concern for individual workers in the bureaucracy, I would support a tag on such stories to warn them before clicking.
EDIT: On reading the other comments here, I don't think even a tag is warranted for this story as no leaking has occurred.
I would prefer something like (LEAK) or (CLASSIFIED LEAK). NSFW means that many ordinary people that aren't subject to the logical contradictions of the classification system won't look because they think it's porn or something.
NSFW isn't enough, if you possess a security clearance it's not safe to view at home either as 1) you can't access classified information if you don't have a need-to-know and 2) you aren't authorized to access classified information from an unclassified computer.
Matt Tait (infosec researcher at the Strauss Center at UT Austin) had some insight into this: https://twitter.com/pwnallthethings/status/96752319618133606... – it's worth reading his first few comments, he has some good thoughts on using fixed with fonts in redacted documents.
So, given the very small list of possible names associated with the subject of this paragraph, I should think the number of permutations would be in the low hundreds. I’d be willing to bet the larger block falls to this analysis by morning in Washington. (Not that that is necessarily the most important block, but it probably has the bounded scope.)
I am working on automating this process and trying to find similar hits for the remaining redactions.
I was thinking that too, but it characterizes Carter Page as a "former campaign foreign policy advisor". The other names may have similar designations, making that a hopeless game.
We do know, from an unredacted footnote, that one of them was Michael Flynn. So that makes 2/4.
No points for the others either: Manafort and Gates have also been indicted publically, and names and titles for each probably fit in the redacted section.
Long thought this should be possible... I am very impressed you figured out a way to do it.
You could iterate the strategy further by creating a full font from the existing text, identifying spacing and etc for each next letter/character, calculate for word wrap lengths, and get it even closer.
It’s probably most effective to identify the software and font that produced the original, exact positioning of text is complex: you’d need to identify kerning pairs, spacing rules within lines, sub pixel positioning (where appropriate), etc.