
Stegasuras: Neural Linguistic Steganography - kcazyz
https://steganography.live/
======
ashton314
I think that this kind of automated stenography demonstrates that legislation
requiring ~~mandated backdoors~~, ahem, _" responsible encryption"_, is
completely nonsensical: we could encrypt our messages with something powerful
and safe, then encode it with something like this. Boom. Plausible deniability
that you were sending anything other than plaintext. Sure, the plaintext is
long-winded and verbose, and maybe a little nonsensical, but would a computer
be able to reliably distinguish between encrypted-then-stenographically-
concealed text and "real" plaintext?

Anyway, really cool work.

~~~
8bitsrule
>Nonsensical

E.g. One sentence of the 'cover text' I got read:

"Then, on September 8, 1816, Sherman surrounded Army headquarters and killed
three emissaries and surrendered with large supplies and safety pins."

Sherman wasn't born until 1820.

~~~
sswaner
But, if you use a different LM Context that is not so easily fact checked then
you won't realize that it is nonsensical.

I used this LM Context: "This message contains confidential information and is
intended only for the individual named. If you are not the named addressee you
should not disseminate, distribute or copy this E-mail. Please notify the
sender immediately by E-mail if you have received this E-mail by mistake and
delete this E-mail from your system."

To encrypt this fake, but real looking PII data:
"Jeffery,Gourlay,2/2/55,495-24-1236"

And produced this output cover text: "Disclaimer: An individual may be
reminded of this email at any time by telephone from a different address, e.g.
e-mail. However, online resources and other media (e.g. e-mailed stories,
articles, etc.) that include original content, are not covered under this
policy. The information contained herein"

Malware or a bad actor could transmit this past enterprise InfoSec controls
all day without easy detection.

~~~
jsilence
Also poetry might make for a good lm context. Nonsenicalness can be explained
as artistical expression.

------
cs702
Very cool!

If I understand correctly, this approach can generate steganographic output
that is _virtually indistinguishable_ from the output generated by the
language model used. Quoting from section 4.3 of the paper:

 _" Most striking, arithmetic coding with the unmodulated language model
induces a q distribution with a KL of 4e-8 nats. This indicates that,
consistent with theory (Sallee, 2004), arithmetic coding enables generative
steganography matching the exact distribution of the language model used."_[a]

Regardless of theory, I can't help but think, holy cow, the K-L divergence
between the distribution of tokens in the steganographic output versus the
output of a GPT-2 model is a minuscule 0.00000004 nats!

Did I understand correctly? Given that language models are getting even larger
and the distribution of their output is getting harder and harder to
distinguish from that of human output[b], I find this incredibly powerful.

Anyway, congratulations and thank you for sharing on HN!

[a] [https://arxiv.org/abs/1909.01496](https://arxiv.org/abs/1909.01496)

[b] [https://nv-adlr.github.io/MegatronLM](https://nv-
adlr.github.io/MegatronLM)

------
ebg13
To the people wondering how to make it work...

On [https://steganography.live/encrypt](https://steganography.live/encrypt)
leave the LM context intact and type your secret message into the secret
message box. Then click Encrypt at the bottom.

Select the output cover text and copy it to your clipboard.

Now go to
[https://steganography.live/decrypt](https://steganography.live/decrypt) and
paste the output cover text in and click Decrypt to recover your secret
message.

------
yellowapple
The steganographic output with the default LM context would make for some
excellent alternate-historical writing prompts:

\----

On January 25, 1798, Washington was elected President of the Confederate
States of America. He later led the United States in the battle of Fort
Sumter, North Carolina, in which he secured the passage of the Union Army,
defeated the enemy and secured its last major victory. Washington was awarded
the Order of the Pacific on December 27, 1798. His wife, Bea, was born in 1833
and settled in the Virginia river valley at East Mary Street, Bowersville. In
1855, Bea had married. Shortly thereafter, General John F. Washington was
elected Governor of Virginia and commissioned to serve a year in captivity. He
became enthralled with the Union and persuaded Henry P. Morgan, to meet his
need for volunteers. Washington volunteered with the Federal troops using a
French silver coin and with their aid the Americans located and captured the
greatest concentration of troops there. After her capture, Bea was sent to
Washington's camp in Richmond, Richmond and had a farewell ring to the
American flag. It was this ring that Washington had set aside for himself with
his death and a letter from his wife, who was

\----

EDIT: decrypting that message twice (i.e. decrypting the result of decrypting
the above message) apparently "works" and produces some interesting results.
It gets even more wacky as I continue decrypting more and more outputs.

------
bajie
Similar to a project I did in college using Markov chains instead. Located at
[http://cellprocessing-1329.appspot.com/stenography](http://cellprocessing-1329.appspot.com/stenography)

Type the text to encode with in the top box, and the text to encode in the
bottom box and press the bottom at the bottom. Easiest is to just click on the
right like Gettysburg Address or Apologies by Plato to fill it with a preset
text, but if you fill it yourself you need some words that follow other words
to work, like at a minimum "a a b a".

------
Bootwizard
The default text and settings didn't do anything, so I'm not exactly sure what
I should be expecting from this site. Did this work for anyone else?

Edit: I can't get it to spit out anything no matter what I do.

~~~
joefkelley
Nope, doesn't work at all for me either.

~~~
ixtli
Type a message into the second box. The one that says "secret text" in order
to get the cover text which you can then put into the decrypt page.

------
gajomi
FYI you need to type something into the "Secrete message" and ask to "Encrypt"
using the prepopulated "LM context".

Both the context and secrete message change the encrypted message

------
pmoriarty
This reminds me of mimic functions:

[https://en.wikipedia.org/wiki/Mimic_function](https://en.wikipedia.org/wiki/Mimic_function)

------
nullc
I couldn't figure out how to combine this kind of model with wet paper codes,
which is too bad since wet paper codes are really the only known way to resist
an attacker with an equally good statistical model.

The closest I came was putting the text in a gray coded word per token form,
then using GPT-2 as the error metric in the encoder, but the resulting bitrate
was very low.

------
rw
Stegasuras is convincing work and the quality looks excellent.

I wrote a steganographic tool in this same spirit back in 2011, called
Plainsight.

Back then, we didn't have deep learning, and the "Imagenet moment for NLP" had
yet to arrive.

My Python code, with examples, is here:
[https://github.com/rw/plainsight](https://github.com/rw/plainsight)

Unlike the OP, my Plainsight algorithm is 100% invertible by construction, and
accepts binary input. (I verified the inversion process with "roundtrip
fuzzing", a technique I still use today.)

Plainsight uses each _bit_ of the input message to generate tokens. Bits are
used to decide how to traverse a Huffman-style n-gram tree, weighted by
frequency. This tree of n-grams is the model used in both the encoding and
decoding steps. The drawbacks to my method are that the output 1) can be
verbose and 2) does not convince a human that it's plausible, except for short
messages.

Stegasuras has orders-of-magnitude better output, and seems to solve the
problems I couldn't solve eight years ago. I would venture that their new
result has as much to do with advances in language modeling, as it does with
the particulars of their encoding and decoding algorithms.

I'll also note that I'm glad these researchers were able to use grant money to
do this work. As a non-academic, I applied for an AI Grant to support me in
upgrading Plainsight to use deep learning, but I was turned away at the time.

Finally, one of the ideas I picked up back then is that spam can be used to
contain secret messages. Send enough gibberish to enough people, with your
intended recipient included, and you'll look like a spammer--not a spy:

    
    
       $ wget https://spamassassin.apache.org/publiccorpus/20030228_spam.tar.bz2
       $ tar -jxvf 20030228_spam.tar.bz2
       $ cat spam/0* > spam-corpus.txt
    
       $ echo "The Magic Words are Squeamish Ossifrage" | plainsight -m encipher -f spam-corpus.txt > spam_ciphertext
       
       $ cat spam_ciphertext
       (8.11.6/8.11.6) 3 (Normal) Internet can send e-mails until to transfer 26 10 [127.0.0.1]
       also include address from the most logical, mail business for your Car have a many our
       portals ESMTP Thu, 29 1.0 this letter on internet, <a style=3D"color: 0px; text/plain;
       cellspacing=3D"0" how quoted-printable about receiving you would like width=3D"15%"
       width=3D"15%" border="0" width="511" Date: Tue, 27 Thu, 19 26 because
       zzzz@localhost.spamassassin.taint.org for
       
       $ cat spam_ciphertext | plainsight -m decipher -f spam-corpus.txt
       Adding models:
       Model: spam-corpus.txt added in 2.57s (context == 2)
       input is "<stdin>", output is "<stdout>"   
       deciphering: 100% | 543.84  B/s | Time: 0:00:00
       
       The Magic Words are Squeamish Ossifrage

------
swsieber
Anybody have a cheatsheet for the acronyms used? It's pretty neat.

~~~
kcazyz
Hi, author here - GPT-2: a recent large language model (generates sentences
conditioned on previous sentences) LM: language model ^ PPL: perplexity
([https://en.wikipedia.org/wiki/Perplexity](https://en.wikipedia.org/wiki/Perplexity))
KL: Kullback–Leibler divergence
([https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diver...](https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence))

~~~
debatem1
Cool demo, is the source live? I'd like to see how it works with ipv4-over-
twitter.

Edit: nevermind, found it. For those looking, it's at
[https://github.com/harvardnlp/NeuralSteganography](https://github.com/harvardnlp/NeuralSteganography)

------
sorokod
What is the bound on size(stegatext) / size(plaintext)

?

------
domnomnom
Scary!!!

------
ngcc_hk
very good article. Highly recommended

