Hacker News new | comments | show | ask | jobs | submit login
Show HN: I made a Chrome extension to reveal zero-width characters (github.com)
385 points by chpmrc 3 months ago | hide | past | web | favorite | 96 comments

A bit tangential, here's a few unix utils you can use to inspect text:

`vis`: display non-printable characters in a visual format

`cat -e`: Display non-printing characters and display a dollar sign (`$') at the end of each line.

`hexdump -c`: Display the input offset in hexadecimal, followed by sixteen space-separated, three column, space-filled, characters of input data per line.

`od -a`: Output named characters.

Further tangent: a colleague and I talked about creating a derivative of one of the programmer fonts, replacing smart/curly quotes and another such nuisances with unmistakeable indicators.

This would be a useful feature for any programming font. I can totally imagine writing exploits through zero-width characters; smart insertion into interpreted strings (scanners/parsers) can potentially insert a security leak by an attacker. Though input can come from many places off course, this would at least make one vector visible.

You could probably just invert any non-ASCII character, and you could probably do this programmatically over an existing font.

> `vis`: display non-printable characters in a visual format

Or `cat -v`, the namesake of http://cat-v.org/. :P

what is cat-v.org? (Obviously I visited that link, read the about page - I'm asking for your summary. it's still not clear to me after my initial visit.)

I'm sure there are better people to answer this question, but cat-v.org is the personal blog & collection of interesting stuff by former plan9 personality uriel (sadly no longer with us). It hosts random thoughts and lots of interesting stuff (including man pages) about plan9 and early unix systems, and miscellaneous other unix trivia.

It really is a treasure. Take the time to browse, it's well worth it if you're somewhat interested in unix history and not overly familiar with the subject already.

Thank you, I found your summary helpful and yes I will look through it.

Nostalgic Unix nerds hating on everything that (i) was not in original Unix, or (ii) in Plan9, or (iii) not implemented by authors of these.

Cynical and disingenuous.

Bitterly invidious Lisp nerd?

(They give off rays or something ... took a cursory look to see if I was being unfair, and booom - https://www.gkayaalp.com/blog/20170206_emacs-argument.html )

Yeah that's me :) At least the second part, because with Emacs I don't really need to be invidious.

TBH I do actually like Unix and Plan9, but having read a big part of cat-v website in the past, most of the views there are fanatical, and are more bitter, cynical and disingenuous than I can ever force myself to be. Especially the way incredibly useful and important OSS projects are denigrated and misrepresented is really annoying (see the harmful stuff section, for example).

Well, I apologize and withdraw the 'bitterly invidious', you aren't.

One part is a matter of taste, and having been around both styles, I like better the 'keep it simple' approach. Too many functions/options overload my simple brain ;-)

Do you have any useful commands like these that can handle multi-byte characters in UTF-8? For instance, handling the zero-width space U+200B, which in UTF-8 takes up more than one byte. I've got some custom scripts that do, I was wondering if there was already something out there.

Ooh! I adore `od`, and know too many options for it. But I didn't know about the others. Thanks!

I'll also add my favorite companion to `od` and hexdumps: `man ascii`.

P.S. May we please have a basic Markdown parser for HN?

Is `hexdump` not `xxd`?

`hexdump` is part of util-linux, `xxd` is not. They do the same thing though, up to formatting differences.

And on Debian it's called `hd`.

xxd is part of vim IIRC.


  $ apt-cache show xxd | grep Source:
  Source: vim (2:8.0.1453-1)
  Source: vim

Oh I must've misunderstood what you meant by "part of" vim.

and under emacs whitespace-mode (just in case)

I agree with other users who said that it would be nice if this happened automatically instead of on click--I think after the initial novelty wore off, I'd probably forget to click it. It might be cleaner to just strip the non-printing characters and display a number of characters stripped, similar to how ad blockers display the number of ads blocked.

I like the adblock analogy, might work on it, thanks.

Please do, and publish as a real extension - I'd definitely use it

How would you analyse content that's asynchronously loaded and inserted into the DOM?

I think DOM mutation observers would be a huge hit on performance.

I'd rather my browser did the right thing slowly than the wrong thing fast.

I think this extension has a very specific use case, particularly that you want to leak confidential information and you don't want it to trace back to you.

As others have mentioned, I could see this being used regularly by someone who plays Eve

But that would be a performance killer or what?

I'd rather my browser did the right thing slowly than the wrong thing fast.

It might make more sense as a clipboard filter.

If you're an English speaker that doesn't often handle languages which gain from zero width characters you could just have a listener scan your clipboard for zero-width characters, silently strip them, and then re-populate the clipboard.

Although I've been thinking about building a clipboard filter for Windows a lot lately, I'm getting tired of copying text to the address bar or notepad to strip text formatting information.

Zero width characters can also be used to force long strings of text to wrap correctly, so this may break the layout of sites that allow things like URLs in UGC text.

In my experience, adding a zero width character into something that someone might copy/paste is generally a bad idea because it'll lead to some very irritated users having to manually remove them from copied data.

Why would a website flow through a user's clipboard?

they're talking about OP's extension that modifies how the website is presented. Not the clipboard idea.

Ctrl+shift+v pastes without formatting

Not on all apps though. Office apps do not respect that shortcut convention and require clicking on the "Paste Without Formatting" button.

In Office, you can hit Ctrl after pasting, then "t" to select paste without formatting, which is nicer than switching from keyboard to mouse. The obnoxious thing is that, if the source text did not actually have formatting, then there is no menu that opens by Ctrl, and "t" is instead inserted verbatim. Wrecks havoc with my muscle memory, that I need to keep track of where I copied text from when pasting.

As much as that would be a helpful shortcut to remember, I know I won't. As a result, I just changed the default paste settings in Office.

That just removes formatting. zero-width characters are not formatting so they wouldn't be removed. They are as real as the letter A - you wouldn't want ctrl+shift+v to remove all A's from your text, would you?

The extension itself could just strip the characters during the copy action only perhaps

There is the puretext utility that strips formatting. Can't remember if it's open source so that you could add stripping zw chars.

It's starting to seem like the universe has some fundamental order for things that we can escape temporarily, but that are inescapable in the long run.

1. Photo and video evidence was a game-changer for establishing facts and chronologies. As we've seen, it's becoming harder and harder to distinguish fake photos and videos from real. It's not a stretch to predict a time when we'll only be able to make probabilistic statements about the veracity of photos or videos. (E.g., "60% liklihood of being undoctored.") This is probably already true about photos, although the expense of making a perfect fake is still pretty high, in terms of expertise.

I'd argue the "natural state" is one where word-of-mouth and first-hand accounts are the most authoritative evidence we can have (other than physical evidence like DNA left behind). And even physical evidence left behind can't tell us what the person did while there or how an event transpired.

2. Tracking communications. In the digital age, we've some people have come to assume that all digital text is untrackable and anonymous. Your "11001110" is the same as mine. Historically, it was pretty difficult to transcribe information without leaving traces of the origin of that info. These zero-width characters plus all the other text fingerprinting methods, and ubiquitous tracking in communication logs make it nearly impossible, again, to communicate with others without leaving a trail. And then there's the writing style analysis which makes it tough to write anything without leaving telltale fingerprints.

So, I'm proposing that we are returning to the "natural state" of things. Probably overstating things a bit, but still an interesting thought to consider.

First-hand accounts are quite literally one of the least reliable sources of truth regarding many kinds of events. Brains fill in a lot of gaps with heuristics.

> ... the "natural state" is one where word-of-mouth and first-hand accounts are the most authoritative evidence we can have (other than physical evidence like DNA left behind)

First-hand accounts aren't always reliable.


DNA is pretty ironclad. What that article shows is that people in authoritative positions are fallible.

I think they're proposing more that these accounts are authentic, not necessarily reliable. It's common sense that eyewitness testimony isn't 100% reliable.

Photographers have been begging the big camera makers (Nikon/Canon) to add cryptographic signatures to photos from their cameras for years, but so far they've resisted doing so.

This is strange, considering the competition going on in this industry.

Ca/Nikon’s main targets are people who shoot fast, a lot and publish very quickly (e.g. sport photographers) and people who will spend an awful lot of time post processing (wedding, art, commercial shootings)

I’m not sure there’s enough benefits for spending processing time, effort and resources on cryptography on any of these fronts.

Now smartphone apps taking “secure” pictures have been there for a while, for use cases like crash site photo for insurance claims for instance. They do fill the niche I think.

Like others pointed out, the "natural state" is even further from ideal. I'm not an expert on cryptography at all, but I think signatures would help in certain situations (and /u/jonahhorowitz brings this up above me). More advanced media - the next step beyond videos, maybe involving 3D or VR - would make fabrication harder. I'm sure there are solutions out there.

The project is clearly motivated by Be careful what you copy: Invisibly inserting usernames into text[0] (posted 12 hours ago on HN), and many people here might think zero-width characters (or rather, any esoteric Unicode characters) are the only (or the prominent) way to watermark texts, which is wrong. The whole topic -watermarking- is worth at the very least a lengthy blog post, and maybe even several articles, so keep in mind that this comment is just a very brief introduction to different methods (of watermarking)[x]:

(from trivial to more complex ones)


1. Using Invisible Characters

The linked[0] article basically talks about a very basic version of it, which should be enough to get the idea. Of course, more sophisticated techniques will use more than just two characters, and will take the position of each invisible character into consideration while encoding & decoding watermarks, ensure it's uniformly distributed throughout the paragraphs etc.

Can be defeated by simply removing invisible characters.


2. Using Unicode Characters That Look Alike

The same working mechanism as IDN homograph attack[1].

Can be defeated by simply removing "out-of-place" letters/characters, after determining the language of a given text/paragraph/sentence etc.


3. Using Unicode Equivalence

> Code point sequences that are defined as canonically equivalent are assumed to have the same appearance and meaning when printed or displayed. For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet).


Can be defeated by simply normalising the text, and line endings!


(Now it's getting harder!)

4. Changing the Layout of Documents

You can change (the rendering of):

  (a) the margins
  (b) the ligatures
  (c) the space between
    (i)  specific characters [kerning]
    (ii) consequent words/lines/paragraphs
to embed fingerprints. This is especially dangerous as documents are often leaked by taking screenshots or photocopies, which is secure against Unicode attacks, but not of these.

Can be defeated by copying the plaintext, and pasting it to a text editor, and applying steps 1, 2, 3.

Also, bear in mind that if you can still -unintentionally- leak information:

(a) when you use a document editor (LibreOffice Writer, Microsoft Word) as the layout engines might act differently depending on your software version, platform, file format etc.

(b) paper size (A4, US Letter, ...)


5. Substituting with Synonyms

If characters can be replaced by their equivalents, why not replace words or sentences even? Words can be substituted by their synonyms based on an algorithm that can create fingerprints accordingly, and I presume even sentences can be rephrased with the latest advancements in AI/ML.

Also, some of the substitutions will be unintentional: a scribble on a piece of leaked document can also leak information about its leaker (different spellings in American and British English, ways of writing date & time, decimal separators etc).

Can be defeated by paraphrasing.


The list can probably extended even further but this was all I could remember on the top of my head. =)

[0]: https://news.ycombinator.com/item?id=16749422

[1]: https://en.wikipedia.org/wiki/IDN_homograph_attack

[x]: Not that I'm working in a related field, but I researched watermarking & fingerprinting techniques for a similar but much more extensive project for journalists/whistle-blowers to detect fingerprinting/watermarking in documents.

Let me know if you are interested and we can collaborate!

Excellent list. For completion's sake, I'd include intentional typos as an instance of category 5. This can be hard to catch if the typo is in a name. A logical extension of that would be entirely made-up names for non-essential people/places.

At the age of 5, Hawking visited Oxford, incidentally passing through Fakenamingham-on-Watermarkshire.

Dictionaries/encyclopedias have been known to insert entirely fake entries as a way of proving ownership.(http://articles.chicagotribune.com/2005-09-21/features/05092...) In the age of ebooks and print-on-demand, those could be tailored to the individual licensee.

>Dictionaries/encyclopedias have been known to insert entirely fake entries as a way of proving ownership.

Just like non-existing places/roads on maps, so-called "trap-streets":


and similar:


Anything like this for VSCode? Just adding option-space accidentally in a ruby file (after an `end`) will cause ruby to explode and not tell you anything. Randomly deleting code until it runs was my only fix.

There is an extension called Highlight Bad Chars. I installed it after reading this

Brilliant, thanks!

Have you thought about revealing things like non-breaking space, thin space, hair space, and so on? I think at least Fecebutt replaces non-breaking spaces with regular spaces in comments, perhaps precisely to frustrate such steganography. (Ironically, last I recall, they do preserve the difference between double and single spaces, e.g. after periods, which is even more invisible in HTML.)

I second the recommendations of vis, cat -e (or cat -vte), and od -a. I didn't know about hexdump; I always used xxd.

I'll be sure to add a link to http://zerowidth.space

It feels like you're trying to make a point on that site but I cannot fathom out what that point was. Are you saying zero width spaces are bad? Are you educating people that they exist? It's not all too clear what your point is.

Also I disagree with the following:

> If you're working on a website there's a good chance that someone will want to copy/paste it, automate using it, etc. Zero-width spaces will create no end of hassle to those people.

I could work around that with literally just 1 line of code:

    newString = replace(originalString, "​", " ")
But these days many HTML parsers will tidy up the output for you so depending on the frameworks you're using, you might not even need to use the above line of code.

> Zero-width spaces will create no end of hassle to those people.

It's up to you to decide what to do with that information.

> newString = replace(originalString, "​", " ")

If I could have registered "zerowidth.character" I would have. You've got to handle different characters and such.

I should say that the site is mainly a joke because we had to use a particular tool which had a bunch of zero width spaces in it, and it got on our nerves.

> If I could have registered "zerowidth.character" I would have. You've got to handle different characters and such.

Ahhh I didn't get that from the site. I thought it was focused on one specific type of zero width character

> I should say that the site is mainly a joke because we had to use a particular tool which had a bunch of zero width spaces in it, and it got on our nerves.

That's fair enough. :)

Thank you! I appreciate it.

OP, does this mess up ZWJ emoji (like any of the skin-tone-modified emoji)?

Sorry I'm afraid I don't understand the question.

It's how a lot of newer emoji work.

Suppose, for example, that you want the emoji "Man Facepalming Medium Skin Tone". There is no single Unicode code point for that. Instead, it uses five code points:

    U+1F926 FACE PALM
    U+2642 MALE SIGN
The first two compose to make the "facepalm" with desired skin tone. The last two make "man". The zero-width joiner composes the first two and last two together, so that the whole sequence renders as a single emoji.

Here's a list of emoji which use the ZWJ this way:


Meaning, does it mess this up?


It's a pretty obscure edge case. You could retain ZWJs when they were between two Unicode points with the Symbol or Emoji Symbol classes, but on further reflection, it would probably be more honest to just strip the ZWJs everywhere and decompose the emojis.

EDIT: also what jedanbik said

Hardly obscure. ZWJ emoji sequences are really common now.

Does anyone remember how many years of pain were caused by different line endings -- "\n" vs. "\r" vs. "\r\n"? Unicode is 10x that, and the pain is just beginning.

Did you mean ^10?

Probably, since it's an omnishambles on a scale I won't pretend to understand. Eight-fingered and two-thumbed humans can't type it, but they've managed to make it a joke (see zalgo text), and it only gets more ridiculous with time (see how flags are done, or emoji skin-tone modifiers). It's only a matter of time before Metafont will be easier to read and write than the hieroglyphics we're supposed to call "text."

I made two python script to encode and decode string into and from zero width characters: Encoder: https://pastebin.com/DGansW69 Decoder: https://pastebin.com/ZVdvjnZc

I know it's pretty basic stuff but it does the job. The encoder outputs both in stdout and into a file so you can copy/past it more easily (from Sublime Text for example)

What about Firefox?

You can use this addon: https://addons.mozilla.org/de/firefox/addon/foxreplace/?src=...

Apply an autoload rule with these settings:

  Name: zero length replacement
  HTML: Output only
  URL: <entering no url matches all sites>
  Replace: (\uFEFF|\u200B|\u200C)
  Type: Regular expression
  With: <span style="background-color:#F00 !important">&#x1f633;</span>

Thank you!

I, too, disprefer Chrome.

You made that word up!

All words are made up so that's ok.

>All words are made up so that's ok.

Not really, JFYI ;):


Time to read the web exclusively in ASCII with UTF-8 characters inserted using their codepoint


lol you didn't even give anyone a chance. I saw someone asking about this in the comments from the article this morning.

I was really intrigued by this zero-width characters thing haha

It would be a nice enhancement to do so automatically (rather than on-click), but at least for me, this serves the purpose.

Yes I agree, or maybe just update the app-icon if it detects presence of zero-width characters, and then you can click to show them.

not only that, but display a banner warning at the top of the page that tells the user the the page contains zero-width characters.

I wonder if some sort of clipboard tool to detect and warn of sneaky looking things might be even more convenient.

If it's a button to click anyway, couldn't this just be a bookmarklet?

Yep, but it was also a way to learn how to build and publish a Chrome extension.

I wonder if this could be used on Tor to get a user ID.

So does it show the ZWJ within emoji combinations?

firefox ?

You can use this addon: https://addons.mozilla.org/de/firefox/addon/foxreplace/?src=...

Apply an autoload rule with these settings:

  Name: zero length replacement
  HTML: Output only
  URL: <entering no url matches all sites>
  Replace: (\uFEFF|\u200B|\u200C)
  Type: Regular expression
  With: <span style="background-color:#F00 !important">&#x1f633;</span>

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact