
Be careful what you copy: Invisibly inserting usernames into text - spongysponges
https://medium.com/@umpox/be-careful-what-you-copy-invisibly-inserting-usernames-into-text-with-zero-width-characters-18b4e6f17b66
======
zbobet2012
I did this (non-publicly) many years ago for my eve online alliance. A
substantial problem exists in that forging the identity of _someone else_ is
fairly easy in a naive scheme if someone detects these characters. That means
you can sow chaos by blaming innocent folks. In practice you'll want to "sign"
the inserted data as well.

Also because of the overhead here and the fact that you will want the
signature to occur at regular intervals a better compression scheme than
0=>char1 1=>char2 is needed. Combining zero width chars and homoglyph
substitution* can produce codings which hold signed usernames in only a few
characters.

There are other, far more interesting ways, to watermark text than this that
are both harder (to impossible) to detect that produce better results.

*[https://www.researchgate.net/publication/308044170_Content-p...](https://www.researchgate.net/publication/308044170_Content-preserving_Text_Watermarking_through_Unicode_Homoglyph_Substitution)

P.S. It's nice to see people publish conference papers on this stuff. I always
had to hide it because we actually used it.

~~~
nzpopa
What about fingerprinting a photographed text? I'm thinking that by encoding
the hidden message to bits and representing them in spaces around some
arbitrary anchor keywords from the original text might work. Extracting the
message then requires either OCR, either manual work(counting spaces).

~~~
cesarb
It could also be encoded in the choice of serif or sans-serif for each
character: [http://elonka.com/friedman/](http://elonka.com/friedman/)

~~~
specialist
That would have never occurred to me. So many tricks.

I wonder if round tripping texts could be an effective sanitizer. Text to
speech and back. English to Chinese and back.

------
irrational
How difficult would it be to write a browser extension to either remove all
zero-width characters or somehow make it super obvious that they are being
used on the page?

I just searched for "zero-width" and "zero width" in Chrome and Firefox's
extensions stores, but didn't come up with anything.

~~~
jessriedel
Would probably be better to do this at the OS level, no? Just ensure that
shift-cmd-V/shift-ctrl-V strips zero-width characters in addition to
formatting. I can't think of a situation where I'd want to keep one but not
the other, and you could always do that manually if it came up.

~~~
inetknght
If it'd also optionally (default=true) strip text formatting from copy/paste
that'd be epic. No more copy-from-browser-then-paste-into-notepad-then-copy-
from-notepad-then-paste-into-email-or-chat

~~~
jessriedel
What matthberg said. I'm on a Mac and I use shift-cmd-V frequently (more often
than cmd-V).

Unfortunately, there are a few apps that use cmd-opt-shift-V instead of cmd-
shift-V. You can fix most of them using this:

[https://apple.stackexchange.com/questions/182970/paste-to-
ma...](https://apple.stackexchange.com/questions/182970/paste-to-match-style-
app-shortcut)

After that, almost everything will use cmd-shift-V. However, Microsoft Word is
still broken, apparently because it uses a slightly different command that
does a similar thing, "Paste and Match Formatting". Haven't found a way to fix
that yet.

~~~
inetknght
Linux.

------
etatoby
This sort of thing is one of the reasons I never liked the "noise texture"
that appeared on MacOS X and other GUIs and websites not so long ago. I always
thought my (former) OS was fingerprinting every screenshot I made. I'd love to
be proven wrong, but you are never too careful.

~~~
chatmasta
This reminds me of a few years back when the internet identified a parody
twitter account by analyzing iOS screenshots it posted.

I just tried googling for the story but I can’t remember what the account was
about. I think it was some sort of parody silicon valley account. It was a
great story, if anyone remembers and can find the link.

~~~
reubenmorais
That was Startup L. Jackson: [https://www.quora.com/Who-is-behind-Startup-L-
Jackson/answer...](https://www.quora.com/Who-is-behind-Startup-L-
Jackson/answer/Nikhil-Dandekar)

------
teolandon
Also be careful of copy-pasting bash commands or install instructions to your
terminal, they can contain hidden zero-width malicious commands, as well as a
newline at the end to make the command run immediately. Ohmyzsh on my machine
detects copy-pasted text and warns you.

~~~
pasta
I thought this was also done by adding spans in the text set to display: none.
You will still copy that text without selecting it:

    
    
      cd /tmp;<span>rm -R ~/;</span>ls;

~~~
teolandon
Yes, I actually confused the two concepts and my comment is a bit misleading.
This exploit is only done using display tricks, NOT using zero-width
characters. Zero-width characters are very limited and can't actually spell
out commands (to my knowledge).

It is though, still another reason to be careful when copy pasting.

Now I'm thinking if you can somehow put an ESC character in text so that when
you copy-paste it into vim, it goes to normal mode and starts performing
commands. Hmm...

~~~
jwilk
Here are PoC exploits against various editors:

[http://www.openwall.com/lists/oss-
security/2018/03/05/2](http://www.openwall.com/lists/oss-
security/2018/03/05/2)

Even pasting to cat(1) might be insecure. The paste can contain ^D, which will
make cat quit; then the rest of the paste will be interpeted by shell.

------
mastazi
The first time I met zero-width characters, (I suppose this was long before
they became popular for "fingerprinting" text) it was in a weird bug where
some javascript would fail due to a \u200b being present in a user-entered
string (it was easily fixed by changing the method that we used to sanitise
strings). I remember thinking "wow with these zero-width characters you could
do steganography within text, even in a very short string". It looks like I
wasn't the only one who had that idea.

------
yzmtf2008
A good utility to use to combat all clipboard-related exploits is one of
Apple's example code: ClipboardViewer[0]. You can see a screenshot of it
displaying the copied code from the demo in the article here[1].

Besides being able to see the hidden characters, you can also see the internal
"layers" of clipboard, e.g., how can a rich-text sentence be pasted to both a
plain-text editor and a rich-text one.

[0]:
[https://developer.apple.com/library/content/samplecode/Clipb...](https://developer.apple.com/library/content/samplecode/ClipboardViewer/Introduction/Intro.html)

[1]: [https://s3.andyfang.me/screenshots/clipboard-
viewer.png](https://s3.andyfang.me/screenshots/clipboard-viewer.png)

~~~
eikenberry
You can also see the unprintable characters at the command prompt by piping
the data through `cat -v`.

------
askvictor
This would be an interesting approach to plagiarism detection; I could see how
it would be used for a couple of online courses that I use with my students.
Of course its just part of the arms race, though.

~~~
Neowizard
My thoughts exactly.

------
ChuckMcM
IBM did something similar with unused high order bits in a firmware image that
Memorex was accused of copying. This was the first time I've seen zero width
characters used, presumably you could build a brainfuck compiler that would
let you write code as zero width spaces :-) Then you could have an invisible
script inside your document. Fun but not particularly useful.

EDIT: Or a whitespace interpreter
([https://en.wikipedia.org/wiki/Whitespace_(programming_langua...](https://en.wikipedia.org/wiki/Whitespace_\(programming_language\)))

------
Semaphor
My favorite use of invisible characters was to enable spoilers in Facebook
without littering the post with garbage.

First line explains it's a spoiler and for what, hundreds of invisible
characters, actual spoiler.

That way FB would just show the first line followed by "read more"

------
fluxsauce
That's a really interesting technique!

I'm trying to think of what else could be done with the encryption /
description, but tracking is a really effective use case.

Could probably encode some other secret messages in there, make a blog post
about cheese include a hash to a pastebin.

It also reminds me of the importance of having strong validation around things
like usernames, because if I had a username that looked official but contained
an invisible character... Related: ICANN explicitly forbids domain names from
including zero-width space.

~~~
blackflame7000
Not much new can be done encryption wise as this falls more under the category
of steganography which is security through obscurity.

------
umpox
If anyone isn't too keen on reading the article:

Source Code: [https://github.com/umpox/zero-width-
detection](https://github.com/umpox/zero-width-detection)

Demo: [https://umpox.github.io/zero-width-
detection](https://umpox.github.io/zero-width-detection)

------
EspadaV9
Although the demo works, if you just copied part of the text your username
isn't taken with it. I think a better way would be to insert the zero-width
characters in between the letters and repeat it throughout the text.

~~~
gfo
Was just thinking this. This would be especially useful with plagiarism
detection.

------
thedailymail
In addition to the use of Diff Checker mentioned by the author, spell-checkers
will also highlight words that are broken up by zero-width characters.

~~~
JorgeGT
Sublime Text package "Sublime Gremlins" [1] detected and highlighted the zero-
width characters:
[https://i.imgur.com/LNlcgRK.png](https://i.imgur.com/LNlcgRK.png)

\---

[1]
[https://github.com/redoPop/SublimeGremlins](https://github.com/redoPop/SublimeGremlins)

~~~
Rotareti
VSCode version: [https://github.com/nhoizey/vscode-
gremlins](https://github.com/nhoizey/vscode-gremlins)

------
buro9
Interesting is how to defend against this.

If you are a journalist wishing to protect your source, what tool could be
used to process content such that the essence is left intact but the unicode
zero-width steganography is stripped... replaced instead by the common space
character.

I know enough to say that you cannot just search and replace, as many of the
zero-width characters have a meaning in different languages and produce a
visual effect when combined with other runes. Just stripping them all will
break text in those languages.

Is there a method for removing zero-width whitespace such that journalist
sources could be protected?

~~~
chatmasta
The safest thing is to retype it. But that doesn’t cover the risk of
synonym/frequency fingerprinting discussed elsewhere in this thread.

~~~
chii
What if you also did a random synonym replacement throughout the piece to
destroy the watermarking? If the source is anonymous and hidden, then
authenticity cannot be checked by the reader anyway, and so replacement
without changing meaning is an acceptable change to protect sources.

~~~
snovv_crash
You would have to change every word, since any could be a waterprinted
synonym. A better way would be to read it, make a summary, then rewrite it
from memory and only use the source data to correct factual differences.

~~~
chatmasta
You wouldn't necessarily have to change every word; just enough to break the
decoding scheme. But even then it's totally random, so the longer the
document, the more opportunities to be fingerprinted. It's like the old saying
goes, "the police only need to be lucky once, but the criminals need to be
lucky all the time." [0]

[0] I never bothered looking up where this came from until just now...
interestingly it's from the IRA, and used in the total opposite way most
people use it now...
[https://en.wikipedia.org/wiki/Brighton_hotel_bombing](https://en.wikipedia.org/wiki/Brighton_hotel_bombing)

~~~
snovv_crash
Your saying illustrates what I'm saying perfectly: if you miss even a single
fingerprinted word, it might uniquely identify you. So you need to change
every single word, and even that isn't enough in case eg. adjectives were
added or omitted.

------
irrational
Where can I find a full list of zero-width characters?

~~~
ken
It looks like there are just four:
[https://en.wikipedia.org/wiki/Zero_width](https://en.wikipedia.org/wiki/Zero_width)

~~~
dotancohen
Not that the zero-width characters are not the only characters that one is
unlikely to notice. I regularly use the RLE, LRE, and other non-printing
characters [1] in my text.

[1]
[http://dotancohen.com/howto/rtl_right_to_left.html](http://dotancohen.com/howto/rtl_right_to_left.html)

~~~
zb3
In addition to RTL/LTR mark characters, there are also TAG characters[1].

[1]
[https://en.wikipedia.org/wiki/Tags_(Unicode_block)](https://en.wikipedia.org/wiki/Tags_\(Unicode_block\))

------
et2o
What was the original purpose for these characters' design?

~~~
throw_away
U+200B: [https://en.wikipedia.org/wiki/Zero-
width_space](https://en.wikipedia.org/wiki/Zero-width_space)

U+200C: [https://en.wikipedia.org/wiki/Zero-width_non-
joiner](https://en.wikipedia.org/wiki/Zero-width_non-joiner)

------
newsbinator
Interestingly, pasting text with zero width chars into an iMessage chat box
does alert the user by showing "question in a box" chars for the zero-width
chars.

------
msoad
Most leaks are screenshots :D

But seriously, ZWNJ is really hard to see when you copy text. The only way to
sanitize it is to run it through a program.

------
jiggunjer
I'm surprised I can't detect them on notepad++

~~~
tinyrick2
I can confirm that Vim shows the hidden characters as question mark.

~~~
aembleton
So does Visual Studio Code

------
userbinator
This reminds me of the times I've had to help beginning students debug very
confusing errors because one of these characters somehow found their way into
someone's source code (likely because a word processor was used to edit it at
some point.)

------
danilocesar
The so-called-better-way of doing this using Unicode substitution can be found
at
[http://smartdata.cs.unibo.it/watermark/](http://smartdata.cs.unibo.it/watermark/)

~~~
newsbinator
I like the concept. Please correct me if I'm wrong: it looks like you'd need a
lot more than "46 to 101 characters" in a message before you can apply this
method reasonably.

An MD5 hash for any decent-length password is long, and this method only
allows you to replace the subset of "confusable" latin chars in a text.

For instance: C = 0x0043 = 0x216d

When you reach one of these replaceable characters, you either replace it or
you don't, which you mark as either 1 or 0.

So for the string "password", our binary MD5 hash is
"01011111010011011100110000111011010110101010011101100101110101100001110110000011001001111101111010111000100000101100111110011001".

That 128 possible replacements needed in the original text.

I imagine the original text would have to be at least 10-20 times that length
before we found enough "confusable" latin chars to replace.

I'm eager to hear what I'm missing, because I do like this method a lot.

------
skookumchuck
Unicode should not have invisible characters.

~~~
dotancohen
Then how would you indicate that a space or newline is intended? Or how would
I indicate that a length of text is to be displayed from right to left when I
post Hebrew text to an English-expecting text field?

~~~
skookumchuck
> Then how would you indicate that a space or newline is intended?

They aren't invisible, you can see the result as spaces and the next line.

> how would I indicate that a length of text is to be displayed from right to
> left when I post Hebrew text to an English-expecting text field?

That is also a visible effect.

~~~
megaremote
Nope, not always. Try putting a newline in HTML, it is ignored.

~~~
skookumchuck
> ignored

Not exactly, it still serves as a word separator. Like the newline (not a
space) I put between "word" and "separator".

------
russdill
The postscript mentions a unique user ID. Ideally you'd just want a hash of
the user ID, some private key, and possibly a session ID. The quick and dirty
method works great for a one off though.

------
DyslexicAtheist
this haphazard attempt at DRM seem like a perfect OpSec layer in case the
purpose is to throw off the gullible media or any "experts" overly keen on
cyber attribution. The method seems flawed unless you actually want your
adversary to find it and make incorrect assumptions, it's probably not the
right tool for the job.

EDIT/PS: the first one who finds the homoglyphs within this very status update
and posts it in the comments wins an iPhone which has still not been patched
against the జ్ఞ‌ా vulnerability ;)

~~~
knolax
in the post this apostrophe: [‘] is used instead if the standard [']
apostrophe or [`] grave. It's pretty trivial to check just by opening it in
any non-Unicode supporting terminal. In fact the zero width characters show up
explicitly as blue <200b> in vim.

People always throw around language compatibility when the unusual features of
Unicode are thrown around, but stuff like zero width characters really don't
need to be supported for language compatibility.

On another note, if Unicode is willing to butcher Chinese/Kanji orthography
with Han unification then it ought to be willing to get rid of Latin
homographs.

------
SubiculumCode
Zero width characters don't seem to have a use case for most web browser
users. It seems that they should be filtered out, and an option to display
left in about::config

edit: Other have pointed out that there ARE legit uses of zero-length chars,
especially for other languages. Still, I bet a solution does not have to be
all or none.. I bet common legitimate use-cases can be segregated from others.

------
John_KZ
Just when I thought I was aware of all major data leak pathways, something
like this comes up and it leaves me dumbfounded.

------
gwbas1c
Chrome on Android put a line break in the middle of a word. I assume that was
an "invisible" character?

------
IloveHN84
Nice to fingerprint text and avoiding that it is just copy/pasted around the
web

------
gromy
Related HN thread regarding a chrome extension to counter this:

[https://news.ycombinator.com/item?id=16754987](https://news.ycombinator.com/item?id=16754987)

------
plesn
How do I see those invisible characters in emacs or vim ? In emacs I thought
that whitespace-mode would do the trick but apparently it doesn't.

~~~
Izkata
vim (7.4) seems to display them by default. With my whole vimrc commented out
(just to be sure it wasn't a setting I changed), I get this:

    
    
        F<200b>or exam<200b>ple, I’ve ins<200b>erted 10 ze<200b>ro-width spa<200b>ces in<200b>to thi<200b>s sentence, c<200b>an you tel<200b><200b>l?
    

(The <200b> is also highlighted a different color than the rest of the text,
and acts like a single character when moving the cursor through it. It's
really, really obvious.)

~~~
plesn
Indeed, I copy-pasted in emacs but not in vim, my bad...

------
partycoder
Otherwise known as steganography. It can be also done with images, text,
sound, DNA and virtually anything that carries information.

------
codedokode

        const zeroPad = num => ‘00000000’.slice(String(num).length) + num;
    

What a bad way to declare a function. Starting with the word 'function' will
make the code much more readable. Arrow functions are supposed to be used as a
small callbacks, not to obfuscate the meaning of the code.

~~~
hyper_reality
I was more confused by the parameter being named 'num' before having the
String function called on it and its length taken - when in fact 'num' should
be a binary string when it's passed into the function. But maybe I've been
spoiled by strong types.

~~~
hodl
The rightmost num is coerced. And the hole is finished on par.

------
dj-wonk
Would someone please share a quick image that shows a visual example?

------
hexo
so, we'll copy text to editor, screenshot it, OCR it and then paste? did I get
it right?

------
feelin_googley

       Zero-width characters are invisible, `non-printing' characters that are
       not displayed by the majority of applications. F*or exam*ple, I've
       ins*erted 10 ze*ro-width spa*ces in*to thi*s sentence, c*an you tel**l?
       (Hint: paste the sentence into Diff Checker to see the locations of the
       characters!). These characters can be used to `fingerprint' text for
       certain users.
    
    

Above is what paragraph looks like in text-only browser in VGA textmode.

~~~
joosters
That's just because the browser doesn't handle unicode well. The 'text-only',
'VGA' and 'textmode' are actually irrelevant. The behaviour you are seeing is
down to programmer choice/laziness/missing support.

~~~
userbinator
If you're only expecting ASCII text, then not being able to show anything else
could even be considered a feature to reduce attack area, since any sort of
Unicode trickery then becomes impossible.

~~~
daveid
There are more languages in the world than English, not supporting Unicode is
not an option for most.

~~~
always_good
Yes, but thread OP was essentially saying "look how those chars appear when
you can only render ASCII." And then the person you replied to said that
explicitly.

I'd wager that both of them know there are more languages than English.

