I once implemented a steganography system to manipulate the ordering of class files in a java jar file (they retain order therein).
The "distance from alphabetically sorted" was calculate-able as a BigInteger. That BigInteger was encode-able to a text message. Hence, a text message could be hidden in the ordering of the class files (assuming there were some reasonable number of class files - and there usually is)
I kept wondering if the author applied any of the text/image steganographic techniques discussed to the paper itself, in a self referential sort of way. Had I been the author, I'd have felt obligated to do so.
> One proposed solution by Judge [14] uses spelling errors to hide data, for example spelling ”is” as ”iz”. A correctly spelled word indicated a zero, a incorrectly spelled word a 1.
And one can see some spelling mistakes or variations across the document... (For instance:"This attack involves the analysis of know patterns the correspond to hidden information" or "being able to here the data hidden in audio" )
Good catch. Also, from the sentences you've just quoted: "a incorrectly" instead of "an incorrectly"... Now, is the author doing these misspellings deliberately?
I have access to some of Apple's secret documentation and they watermark your name and email address over all the PDFs. I often wondered if they do something like this as well, just in case.
Only using synonyms is pretty limited though. You could go much further:
1. In PDF documents use layouts that differ by only a couple of pixels.
2. Change the placement of figures.
3. Change brackets to dashes (like this) - or like this.
4. Add double spaces in places.
5. Adjust the hyphenation at the end of long lines.
6. Use unicode look-a-like characters.
7. Use different unicode variations of dashes and quotation marks.
If you only had a single document and wanted to leak it while being certain that your name wouldn't be leaked too, the only way I can see to do it would be to rewrite it in your own words.
Out of curiousity, is this watermarking you speak of actual visual watermarking, or is it more stenographic like the techniques you mention. (If you may share in general terms, of course...)
This is a good idea. I wonder if deep-learning will be useful to strongly test it this way:
1. divide a set of documents from the same person , apply stenography to half. represent each word by a number.
2. take half of this data and use it to train your deep-learning algorithm to categorize documents as clear/stegoe'd. Than test how well it can detect stego in the other half.
The inputs to those deep-learning should be as followed: num_of_word_0, synonym_number(num_of_word_0), .., num_of_word_n, synonym_number(num_of_word_n)
Where synonym_number(word) is: if we start with home-house-abode from wordnet as synonyms, home=1, abode=3.
This way we give the deep-learner info about your synonym use stats.
3. run algorithm.
Not too complex, and there's a decent likelihood that if there are statistical patterns, the deep-learner will find them.
No that is something unrelated. The way you would have to detect this is by obtaining the document from multiple sources and checking them for differences.
I've wondered about doing this before for copy protection; mixing the order of bullet points, and so on. You can't stop copies, but you can see who made the copies potentially.
>I've wondered about doing this before for copy protection
Something similar -- using functionally-equivalent machine language instructions instead of synonyms -- was (is?) used by the shareware A86 assembler[1] to detect unregistered use.
From the A86 manual:
A86 takes advantage of situations in which more than one set
of opcodes can be generated for the same instruction. [...]
A86 adopts an unusual mix of choices in such situations.
This creates a code-generation "footprint" that ... will
enable me to tell ... if a non-trivial object file has
been produced by A86.
First of all, forensic linguistics is much much less powerful than it's made out to be. In particular there's no really good way to get a confidence estimate out of the prediction. All you can get is a likelihood ratio between N different suspects. You can't really get "definitely a match" vs "I don't know".
Anyway. The best solution would be to adopt some constraint, e.g. by being forced to write in haikus, being forced to write in "upgoer 5 style" Basic English, being forced to only write sentences that Google Translate stably round trips English<->French, etc.
The accuracy of a forensic linguistic algorithm trained on normal text and run on the stylistically constrained text is completely unknown. Hopefully this evidence would be inadmissible even by forensic science standards. Then again, maybe not.
Not really. Authorship attribution methods typically rely on function words (the most frequent words), which cannot be readily substituted, as that would require rewriting the whole sentence.
I don't think you can be. You could train authorship attribution on many kinds of features. But checking against common methods would probably go a long way in avoiding detection.
I once implemented a steganography system to manipulate the ordering of class files in a java jar file (they retain order therein).
The "distance from alphabetically sorted" was calculate-able as a BigInteger. That BigInteger was encode-able to a text message. Hence, a text message could be hidden in the ordering of the class files (assuming there were some reasonable number of class files - and there usually is)