
Show HN: I made a Chrome extension to reveal zero-width characters - chpmrc
https://github.com/chpmrc/zero-width-chrome-extension
======
TheAceOfHearts
A bit tangential, here's a few unix utils you can use to inspect text:

`vis`: display non-printable characters in a visual format

`cat -e`: Display non-printing characters and display a dollar sign (`$') at
the end of each line.

`hexdump -c`: Display the input offset in hexadecimal, followed by sixteen
space-separated, three column, space-filled, characters of input data per
line.

`od -a`: Output named characters.

~~~
pavelbr
Is `hexdump` not `xxd`?

~~~
gkya
xxd is part of vim IIRC.

~~~
na85
Nope.

[https://linux.die.net/man/1/xxd](https://linux.die.net/man/1/xxd)

~~~
gkya
Yep:

    
    
      $ apt-cache show xxd | grep Source:
      Source: vim (2:8.0.1453-1)
      Source: vim

~~~
na85
Oh I must've misunderstood what you meant by "part of" vim.

------
kerkeslager
I agree with other users who said that it would be nice if this happened
automatically instead of on click--I think after the initial novelty wore off,
I'd probably forget to click it. It might be cleaner to just strip the non-
printing characters and display a number of characters stripped, similar to
how ad blockers display the number of ads blocked.

~~~
chpmrc
I like the adblock analogy, might work on it, thanks.

~~~
Torn
Please do, and publish as a real extension - I'd definitely use it

------
Someone1234
It might make more sense as a clipboard filter.

If you're an English speaker that doesn't often handle languages which gain
from zero width characters you could just have a listener scan your clipboard
for zero-width characters, silently strip them, and then re-populate the
clipboard.

Although I've been thinking about building a clipboard filter for Windows a
lot lately, I'm getting tired of copying text to the address bar or notepad to
strip text formatting information.

~~~
Alex3917
Zero width characters can also be used to force long strings of text to wrap
correctly, so this may break the layout of sites that allow things like URLs
in UGC text.

~~~
Someone1234
Why would a website flow through a user's clipboard?

~~~
teolandon
they're talking about OP's extension that modifies how the website is
presented. Not the clipboard idea.

------
ballenf
It's starting to seem like the universe has some fundamental order for things
that we can escape temporarily, but that are inescapable in the long run.

1\. Photo and video evidence was a game-changer for establishing facts and
chronologies. As we've seen, it's becoming harder and harder to distinguish
fake photos and videos from real. It's not a stretch to predict a time when
we'll only be able to make probabilistic statements about the veracity of
photos or videos. (E.g., "60% liklihood of being undoctored.") This is
probably already true about photos, although the expense of making a perfect
fake is still pretty high, in terms of expertise.

I'd argue the "natural state" is one where word-of-mouth and first-hand
accounts are the most authoritative evidence we can have (other than physical
evidence like DNA left behind). And even physical evidence left behind can't
tell us what the person did while there or how an event transpired.

2\. Tracking communications. In the digital age, we've some people have come
to assume that all digital text is untrackable and anonymous. Your "11001110"
is the same as mine. Historically, it was pretty difficult to transcribe
information without leaving traces of the origin of that info. These zero-
width characters plus all the other text fingerprinting methods, and
ubiquitous tracking in communication logs make it nearly impossible, again, to
communicate with others without leaving a trail. And then there's the writing
style analysis which makes it tough to write anything without leaving telltale
fingerprints.

So, I'm proposing that we are returning to the "natural state" of things.
Probably overstating things a bit, but still an interesting thought to
consider.

~~~
dctoedt
> _... the "natural state" is one where word-of-mouth and first-hand accounts
> are the most authoritative evidence we can have (other than physical
> evidence like DNA left behind)_

First-hand accounts aren't always reliable.

[https://en.wikipedia.org/wiki/Eyewitness_testimony](https://en.wikipedia.org/wiki/Eyewitness_testimony)

~~~
heartbreak
Neither is DNA for that matter.

[https://www.theatlantic.com/magazine/archive/2016/06/a-reaso...](https://www.theatlantic.com/magazine/archive/2016/06/a-reasonable-
doubt/480747/)

~~~
goldenkey
DNA is pretty ironclad. What that article shows is that people in
authoritative positions are fallible.

------
kragen
Have you thought about revealing things like non-breaking space, thin space,
hair space, and so on? I think at least Fecebutt replaces non-breaking spaces
with regular spaces in comments, perhaps precisely to frustrate such
steganography. (Ironically, last I recall, they do preserve the difference
between double and single spaces, e.g. after periods, which is even more
invisible in HTML.)

I second the recommendations of vis, cat -e (or cat -vte), and od -a. I didn't
know about hexdump; I always used xxd.

------
boramalper
The project is clearly motivated by _Be careful what you copy: Invisibly
inserting usernames into text_ [0] (posted 12 hours ago on HN), and many
people here might think zero-width characters (or rather, any esoteric Unicode
characters) are the only (or the prominent) way to watermark texts, which is
wrong. The whole topic - _watermarking_ \- is worth at the very least a
lengthy blog post, and maybe even several articles, so keep in mind that this
comment is just a very brief introduction to different methods (of
watermarking)[x]:

(from trivial to more complex ones)

\----

1\. Using Invisible Characters

The linked[0] article basically talks about a very basic version of it, which
should be enough to get the idea. Of course, more sophisticated techniques
will use more than just two characters, and will take the _position_ of each
invisible character into consideration while encoding & decoding watermarks,
ensure it's uniformly distributed throughout the paragraphs etc.

Can be defeated by simply removing invisible characters.

\----

2\. Using Unicode Characters That Look Alike

The same working mechanism as _IDN homograph attack_ [1].

Can be defeated by simply removing "out-of-place" letters/characters, after
determining the language of a given text/paragraph/sentence etc.

\----

3\. Using Unicode Equivalence

> Code point sequences that are defined as canonically equivalent are assumed
> to have the same appearance and meaning when printed or displayed. For
> example, the code point U+006E (the Latin lowercase "n") followed by U+0303
> (the combining tilde "◌̃") is defined by Unicode to be canonically
> equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the
> Spanish alphabet).

[https://en.wikipedia.org/wiki/Unicode_equivalence](https://en.wikipedia.org/wiki/Unicode_equivalence)

Can be defeated by simply normalising the text, and line endings!

\----

(Now it's getting harder!)

4\. Changing the Layout of Documents

You can change (the rendering of):

    
    
      (a) the margins
      (b) the ligatures
      (c) the space between
        (i)  specific characters [kerning]
        (ii) consequent words/lines/paragraphs
    

to embed fingerprints. This is especially dangerous as documents are often
leaked by taking screenshots or photocopies, which is secure against Unicode
attacks, but not of these.

Can be defeated by copying the plaintext, and pasting it to a text editor, and
applying steps 1, 2, 3.

Also, bear in mind that if you can still -unintentionally- leak information:

(a) when you use a document editor (LibreOffice Writer, Microsoft Word) as the
layout engines might act differently depending on your software version,
platform, file format etc.

(b) paper size (A4, US Letter, ...)

\----

5\. Substituting with Synonyms

If characters can be replaced by their equivalents, why not replace words or
sentences even? Words can be substituted by their synonyms based on an
algorithm that can create fingerprints accordingly, and I presume even
sentences can be rephrased with the latest advancements in AI/ML.

Also, some of the substitutions will be _unintentional_ : a scribble on a
piece of leaked document can also leak information about its leaker (different
spellings in American and British English, ways of writing date & time,
decimal separators etc).

Can be defeated by paraphrasing.

\----

The list can probably extended even further but this was all I could remember
on the top of my head. =)

[0]:
[https://news.ycombinator.com/item?id=16749422](https://news.ycombinator.com/item?id=16749422)

[1]:
[https://en.wikipedia.org/wiki/IDN_homograph_attack](https://en.wikipedia.org/wiki/IDN_homograph_attack)

[x]: Not that I'm working in a related field, but I researched watermarking &
fingerprinting techniques for a similar but much more extensive project for
journalists/whistle-blowers to detect fingerprinting/watermarking in
documents.

Let me know if you are interested and we can collaborate!

~~~
zuminator
Excellent list. For completion's sake, I'd include intentional typos as an
instance of category 5. This can be hard to catch if the typo is in a name. A
logical extension of that would be entirely made-up names for non-essential
people/places.

 _At the age of 5, Hawking visited Oxford, incidentally passing through
Fakenamingham-on-Watermarkshire._

Dictionaries/encyclopedias have been known to insert entirely fake entries as
a way of proving
ownership.([http://articles.chicagotribune.com/2005-09-21/features/05092...](http://articles.chicagotribune.com/2005-09-21/features/0509200275_1_electronic-
dictionary-new-oxford-american-dictionary-new-yorker)) In the age of ebooks
and print-on-demand, those could be tailored to the individual licensee.

~~~
jaclaz
>Dictionaries/encyclopedias have been known to insert entirely fake entries as
a way of proving ownership.

Just like non-existing places/roads on maps, so-called "trap-streets":

[https://en.wikipedia.org/wiki/Trap_street](https://en.wikipedia.org/wiki/Trap_street)

and similar:

[https://en.wikipedia.org/wiki/Agloe,_New_York](https://en.wikipedia.org/wiki/Agloe,_New_York)

~~~
beautifulfreak
also mountweazels [https://www.thoughtco.com/mountweazel-words-
term-1691330](https://www.thoughtco.com/mountweazel-words-term-1691330)

------
jacquesc
Anything like this for VSCode? Just adding option-space accidentally in a ruby
file (after an `end`) will cause ruby to explode and not tell you anything.
Randomly deleting code until it runs was my only fix.

~~~
iyrkki_odyss
There is an extension called Highlight Bad Chars. I installed it after reading
this

~~~
jacquesc
Brilliant, thanks!

------
EamonnMR
I'll be sure to add a link to [http://zerowidth.space](http://zerowidth.space)

~~~
laumars
It feels like you're trying to make a point on that site but I cannot fathom
out what that point was. Are you saying zero width spaces are bad? Are you
educating people that they exist? It's not all too clear what your point is.

Also I disagree with the following:

> _If you 're working on a website there's a good chance that someone will
> want to copy/paste it, automate using it, etc. Zero-width spaces will create
> no end of hassle to those people._

I could work around that with literally just 1 line of code:

    
    
        newString = replace(originalString, "&#8203;", " ")
    

But these days many HTML parsers will tidy up the output for you so depending
on the frameworks you're using, you might not even need to use the above line
of code.

~~~
EamonnMR
> _Zero-width spaces will create no end of hassle to those people._

It's up to you to decide what to do with that information.

> newString = replace(originalString, "&#8203;", " ")

If I could have registered "zerowidth.character" I would have. You've got to
handle different characters and such.

I should say that the site is mainly a joke because we had to use a particular
tool which had a bunch of zero width spaces in it, and it got on our nerves.

~~~
laumars
> If I could have registered "zerowidth.character" I would have. _You 've got
> to handle different characters and such._

Ahhh I didn't get that from the site. I thought it was focused on one specific
type of zero width character

> I should say that the site is mainly a joke because we had to use a
> particular tool which had a bunch of zero width spaces in it, and it got on
> our nerves.

That's fair enough. :)

------
lainga
OP, does this mess up ZWJ emoji (like any of the skin-tone-modified emoji)?

~~~
chpmrc
Sorry I'm afraid I don't understand the question.

~~~
lainga
It's a pretty obscure edge case. You could retain ZWJs when they were between
two Unicode points with the Symbol or Emoji Symbol classes, but on further
reflection, it would probably be more honest to just strip the ZWJs everywhere
and decompose the emojis.

EDIT: also what jedanbik said

~~~
ascorbic
Hardly obscure. ZWJ emoji sequences are really common now.

------
username223
Does anyone remember how many years of pain were caused by different line
endings -- "\n" vs. "\r" vs. "\r\n"? Unicode is 10x that, and the pain is just
beginning.

~~~
yuchi
Did you mean ^10?

~~~
username223
Probably, since it's an omnishambles on a scale I won't pretend to understand.
Eight-fingered and two-thumbed humans can't type it, but they've managed to
make it a joke (see zalgo text), and it only gets more ridiculous with time
(see how flags are done, or emoji skin-tone modifiers). It's only a matter of
time before Metafont will be easier to read and write than the hieroglyphics
we're supposed to call "text."

------
sam1902
I made two python script to encode and decode string into and from zero width
characters: Encoder:
[https://pastebin.com/DGansW69](https://pastebin.com/DGansW69) Decoder:
[https://pastebin.com/ZVdvjnZc](https://pastebin.com/ZVdvjnZc)

I know it's pretty basic stuff but it does the job. The encoder outputs both
in stdout and into a file so you can copy/past it more easily (from Sublime
Text for example)

------
IloveHN84
What about Firefox?

~~~
chanz
You can use this addon:
[https://addons.mozilla.org/de/firefox/addon/foxreplace/?src=...](https://addons.mozilla.org/de/firefox/addon/foxreplace/?src=search)

Apply an autoload rule with these settings:

    
    
      Name: zero length replacement
      HTML: Output only
      URL: <entering no url matches all sites>
      Replace: (\uFEFF|\u200B|\u200C)
      Type: Regular expression
      With: <span style="background-color:#F00 !important">&#x1f633;</span>

~~~
ric2b
Thank you!

------
flashman
Time to read the web exclusively in ASCII with UTF-8 characters inserted using
their codepoint

~WHITE SMILING FACE~ ~THUMBS UP SIGN~

------
gormz
lol you didn't even give anyone a chance. I saw someone asking about this in
the comments from the article this morning.

~~~
chpmrc
I was really intrigued by this zero-width characters thing haha

------
gervase
It would be a nice enhancement to do so automatically (rather than on-click),
but at least for me, this serves the purpose.

~~~
kawsper
Yes I agree, or maybe just update the app-icon if it detects presence of zero-
width characters, and then you can click to show them.

------
SuperGoodJared
I wonder if some sort of clipboard tool to detect and warn of sneaky looking
things might be even more convenient.

------
kbd
If it's a button to click anyway, couldn't this just be a bookmarklet?

~~~
chpmrc
Yep, but it was also a way to learn how to build and publish a Chrome
extension.

------
beautifulfreak
I wonder if this could be used on Tor to get a user ID.

------
tenryuu
So does it show the ZWJ within emoji combinations?

------
srebalaji
firefox ?

~~~
chanz
You can use this addon:
[https://addons.mozilla.org/de/firefox/addon/foxreplace/?src=...](https://addons.mozilla.org/de/firefox/addon/foxreplace/?src=search)

Apply an autoload rule with these settings:

    
    
      Name: zero length replacement
      HTML: Output only
      URL: <entering no url matches all sites>
      Replace: (\uFEFF|\u200B|\u200C)
      Type: Regular expression
      With: <span style="background-color:#F00 !important">&#x1f633;</span>

