
A Synonym-Substitution Based Algorithm for Text Steganography (2012) [pdf] - n-s-f
http://www.cs.bham.ac.uk/~nagarajs/papers/J.gardiner-MScReport.pdf
======
zinxq
Laughed out loud reading that title. So clever.

I once implemented a steganography system to manipulate the ordering of class
files in a java jar file (they retain order therein).

The "distance from alphabetically sorted" was calculate-able as a BigInteger.
That BigInteger was encode-able to a text message. Hence, a text message could
be hidden in the ordering of the class files (assuming there were some
reasonable number of class files - and there usually is)

~~~
specialist
Clever. Inspired by this?

[https://en.wikipedia.org/wiki/Acrostic](https://en.wikipedia.org/wiki/Acrostic)

------
bsaunder
I kept wondering if the author applied any of the text/image steganographic
techniques discussed to the paper itself, in a self referential sort of way.
Had I been the author, I'd have felt obligated to do so.

------
vinchuco
From the document:

> One proposed solution by Judge [14] uses spelling errors to hide data, for
> example spelling ”is” as ”iz”. A correctly spelled word indicated a zero, a
> incorrectly spelled word a 1.

And one can see some spelling mistakes or variations across the document...
(For instance:"This attack involves the analysis of know patterns the
correspond to hidden information" or "being able to here the data hidden in
audio" )

~~~
ChrisGranger
Good catch. Also, from the sentences you've just quoted: "a incorrectly"
instead of "an incorrectly"... Now, is the author doing these misspellings
deliberately?

------
IshKebab
I have access to some of Apple's secret documentation and they watermark your
name and email address over all the PDFs. I often wondered if they do
something like this as well, just in case.

Only using synonyms is pretty limited though. You could go much further:

1\. In PDF documents use layouts that differ by only a couple of pixels. 2\.
Change the placement of figures. 3\. Change brackets to dashes (like this) -
or like this. 4\. Add double spaces in places. 5\. Adjust the hyphenation at
the end of long lines. 6\. Use unicode look-a-like characters. 7\. Use
different unicode variations of dashes and quotation marks.

If you only had a single document and wanted to leak it while being certain
that your name wouldn't be leaked too, the only way I can see to do it would
be to rewrite it in your own words.

~~~
sbhere
Out of curiousity, is this watermarking you speak of actual visual
watermarking, or is it more stenographic like the techniques you mention. (If
you may share in general terms, of course...)

~~~
IshKebab
It's visual (large grey text across the page at an angle).

------
minthd
This is a good idea. I wonder if deep-learning will be useful to strongly test
it this way:

1\. divide a set of documents from the same person , apply stenography to
half. represent each word by a number.

2\. take half of this data and use it to train your deep-learning algorithm to
categorize documents as clear/stegoe'd. Than test how well it can detect stego
in the other half.

The inputs to those deep-learning should be as followed: num_of_word_0,
synonym_number(num_of_word_0), .., num_of_word_n,
synonym_number(num_of_word_n)

Where synonym_number(word) is: if we start with home-house-abode from wordnet
as synonyms, home=1, abode=3.

This way we give the deep-learner info about your synonym use stats.

3\. run algorithm.

Not too complex, and there's a decent likelihood that if there are statistical
patterns, the deep-learner will find them.

------
MPSimmons
It seems like this would be detectable using a technique talked about by Bruce
Schneier in 2011
([https://www.schneier.com/blog/archives/2011/08/identifying_p...](https://www.schneier.com/blog/archives/2011/08/identifying_peo_2.html))
called forensic linguistics
([https://en.wikipedia.org/wiki/Forensic_linguistics](https://en.wikipedia.org/wiki/Forensic_linguistics))

~~~
IshKebab
No that is something unrelated. The way you would have to detect this is by
obtaining the document from multiple sources and checking them for
differences.

------
peteretep
I've wondered about doing this before for copy protection; mixing the order of
bullet points, and so on. You can't stop copies, but you can see who made the
copies potentially.

~~~
edmccard
>I've wondered about doing this before for copy protection

Something similar -- using functionally-equivalent machine language
instructions instead of synonyms -- was (is?) used by the shareware A86
assembler[1] to detect unregistered use.

From the A86 manual:

    
    
        A86 takes advantage of situations in which more than one set
        of opcodes can be generated for the same instruction. [...]
        A86 adopts an unusual mix of choices in such situations.
        This creates a code-generation "footprint" that ... will
        enable me to tell ... if a non-trivial object file has
        been produced by A86.
    

[1] [http://eji.com/a86/](http://eji.com/a86/)

------
sugarfactory
This might also be useful for combating attempts to reveal authors' identities
using forensic linguistics.

~~~
irremediable
How would you ensure that you replace the right parts?

~~~
andreasvc
I don't think you can be. You could train authorship attribution on many kinds
of features. But checking against common methods would probably go a long way
in avoiding detection.

