A Synonym-Substitution Based Algorithm for Text Steganography (2012) [pdf]

zinxq · on July 22, 2015

Laughed out loud reading that title. So clever.

I once implemented a steganography system to manipulate the ordering of class files in a java jar file (they retain order therein).

The "distance from alphabetically sorted" was calculate-able as a BigInteger. That BigInteger was encode-able to a text message. Hence, a text message could be hidden in the ordering of the class files (assuming there were some reasonable number of class files - and there usually is)

specialist · on July 22, 2015

Clever. Inspired by this?

https://en.wikipedia.org/wiki/Acrostic

bsaunder · on July 22, 2015

I kept wondering if the author applied any of the text/image steganographic techniques discussed to the paper itself, in a self referential sort of way. Had I been the author, I'd have felt obligated to do so.

vinchuco · on July 22, 2015

From the document:

> One proposed solution by Judge [14] uses spelling errors to hide data, for example spelling ”is” as ”iz”. A correctly spelled word indicated a zero, a incorrectly spelled word a 1.

And one can see some spelling mistakes or variations across the document... (For instance:"This attack involves the analysis of know patterns the correspond to hidden information" or "being able to here the data hidden in audio" )

ChrisGranger · on July 22, 2015

Good catch. Also, from the sentences you've just quoted: "a incorrectly" instead of "an incorrectly"... Now, is the author doing these misspellings deliberately?

IshKebab · on July 22, 2015

I have access to some of Apple's secret documentation and they watermark your name and email address over all the PDFs. I often wondered if they do something like this as well, just in case.

Only using synonyms is pretty limited though. You could go much further:

1. In PDF documents use layouts that differ by only a couple of pixels. 2. Change the placement of figures. 3. Change brackets to dashes (like this) - or like this. 4. Add double spaces in places. 5. Adjust the hyphenation at the end of long lines. 6. Use unicode look-a-like characters. 7. Use different unicode variations of dashes and quotation marks.

If you only had a single document and wanted to leak it while being certain that your name wouldn't be leaked too, the only way I can see to do it would be to rewrite it in your own words.

sbhere · on July 22, 2015

Out of curiousity, is this watermarking you speak of actual visual watermarking, or is it more stenographic like the techniques you mention. (If you may share in general terms, of course...)

IshKebab · on July 24, 2015

It's visual (large grey text across the page at an angle).

minthd · on July 22, 2015

This is a good idea. I wonder if deep-learning will be useful to strongly test it this way:

1. divide a set of documents from the same person , apply stenography to half. represent each word by a number.

2. take half of this data and use it to train your deep-learning algorithm to categorize documents as clear/stegoe'd. Than test how well it can detect stego in the other half.

The inputs to those deep-learning should be as followed: num_of_word_0, synonym_number(num_of_word_0), .., num_of_word_n, synonym_number(num_of_word_n)

Where synonym_number(word) is: if we start with home-house-abode from wordnet as synonyms, home=1, abode=3.

This way we give the deep-learner info about your synonym use stats.

3. run algorithm.

Not too complex, and there's a decent likelihood that if there are statistical patterns, the deep-learner will find them.

MPSimmons · on July 22, 2015

It seems like this would be detectable using a technique talked about by Bruce Schneier in 2011 (https://www.schneier.com/blog/archives/2011/08/identifying_p...) called forensic linguistics (https://en.wikipedia.org/wiki/Forensic_linguistics)

IshKebab · on July 22, 2015

No that is something unrelated. The way you would have to detect this is by obtaining the document from multiple sources and checking them for differences.

peteretep · on July 22, 2015

I've wondered about doing this before for copy protection; mixing the order of bullet points, and so on. You can't stop copies, but you can see who made the copies potentially.

edmccard · on July 22, 2015

>I've wondered about doing this before for copy protection

Something similar -- using functionally-equivalent machine language instructions instead of synonyms -- was (is?) used by the shareware A86 assembler[1] to detect unregistered use.

From the A86 manual:

    A86 takes advantage of situations in which more than one set
    of opcodes can be generated for the same instruction. [...]
    A86 adopts an unusual mix of choices in such situations.
    This creates a code-generation "footprint" that ... will
    enable me to tell ... if a non-trivial object file has
    been produced by A86.

[1] http://eji.com/a86/

sugarfactory · on July 22, 2015

This might also be useful for combating attempts to reveal authors' identities using forensic linguistics.

syllogism · on July 22, 2015

I've thought about that problem a bit.

First of all, forensic linguistics is much much less powerful than it's made out to be. In particular there's no really good way to get a confidence estimate out of the prediction. All you can get is a likelihood ratio between N different suspects. You can't really get "definitely a match" vs "I don't know".

Anyway. The best solution would be to adopt some constraint, e.g. by being forced to write in haikus, being forced to write in "upgoer 5 style" Basic English, being forced to only write sentences that Google Translate stably round trips English<->French, etc.

The accuracy of a forensic linguistic algorithm trained on normal text and run on the stylistically constrained text is completely unknown. Hopefully this evidence would be inadmissible even by forensic science standards. Then again, maybe not.

andreasvc · on July 22, 2015

Not really. Authorship attribution methods typically rely on function words (the most frequent words), which cannot be readily substituted, as that would require rewriting the whole sentence.

irremediable · on July 22, 2015

How would you ensure that you replace the right parts?

andreasvc · on July 22, 2015

I don't think you can be. You could train authorship attribution on many kinds of features. But checking against common methods would probably go a long way in avoiding detection.