Hacker News new | past | comments | ask | show | jobs | submit login
Be careful what you copy: Invisibly inserting usernames into text (medium.com/umpox)
727 points by spongysponges on April 4, 2018 | hide | past | favorite | 191 comments

I did this (non-publicly) many years ago for my eve online alliance. A substantial problem exists in that forging the identity of _someone else_ is fairly easy in a naive scheme if someone detects these characters. That means you can sow chaos by blaming innocent folks. In practice you'll want to "sign" the inserted data as well.

Also because of the overhead here and the fact that you will want the signature to occur at regular intervals a better compression scheme than 0=>char1 1=>char2 is needed. Combining zero width chars and homoglyph substitution* can produce codings which hold signed usernames in only a few characters.

There are other, far more interesting ways, to watermark text than this that are both harder (to impossible) to detect that produce better results.


P.S. It's nice to see people publish conference papers on this stuff. I always had to hide it because we actually used it.

I saw the topic and the first thing my mind went to was eve online :)

People could still just make a screenshot and the special character would not be visible, so they implemented hidden watermarks: [0]

Then the same alliance would also generate slightly different text for each person. [1]

I think things might have changed now days, as everyone is recruiting new players and make it easy for other alliances to put spies in. Maybe in some sub-groups that manage large amounts of assets?

[0] http://failheap-challenge.com/showthread.php?12731-Once-upon...

[1] http://failheap-challenge.com/showthread.php?16311-Taking-th...

Pandemic Legion? I haven't looked at the articles yet but I swear I remember them doing that in their forums.

Edit: Good old PL.

Yip, did that as spying was a serious part of eve-online, so much metagaming in that game, fun times.

Later, we just told select people we had implemented some crazy undetectable watermarking, and the leaks stopped - effectively identifying the culprits in the process.

The Death Note method. Reminds me of https://www.gwern.net/Death-Note-Anonymity

Whoa, does anyone else get huge memory usage on that page? I'm on Chrome 64.0.3282.186 and that page was using ~2GB.

More on topic it's always shocking how many things serve to deanonymize these days.

You could say, you snuffed them out!

Kings of lowsec watermarking.

I miss Eve so much, there isn't a mention of Eve on this site that doesn't get me pining for the game again. Sadly everyone i flew with no longer play :/ What the status of the game, last i looked i barely recognised any on the sov/influence map.

Still a lot of people playing it. A bit less subscribers, but not too bad.

In 0.0 it's captials everywhere. People mine in Rorquals, people do PvE in Carriers and Supers. If you tackle a small ships doing PvE it opens a cyno and Captials jump in.

They now have fighters that have good applications against smaller ships. So the good old rock paper scissors for shipssize in eve is not more.

Still a diverse game, from solo to large fleet fights. I have my fun.

I started in 2004, and played semi seriously for big bouts until probably 2012 or so. Every year or news story would get me back in...and i'd recognise less and less. I stopped myself reading eve politics not that long ago, as that was the bit that most interested me really. After the glory days of roaming in small fleets.

The last time i logged in, a couple of years back. I contemplated moving my shit to somewhere else to where one of the few people i knew was still playing but the jump fatigued meant that would take half my life. I get the whole point of trying to limit force projection cos it was very daft with every encounter in 0.0 finishing with PL, NC or Goons dropping in their super fleets. But logistics for the common player is awkward :( It would likely take me many days of gathering and moving shit to get into a position to actually play the game again haha. not fun.

Just last week we got a patch were they capped the jump fatigue significantly. I don't know how much as I don't use jump drive, but it should be better now.

Also just sell your stuff and rebuy it. You don't need many ships to have some fun.

Roaming is .. not that common any more. What we do is live in a Wormhole, jump into 0.0 and try to tackle something big and hope they form a defense fleet.

Roll the hole to get a new 0.0 exit. Rinse and repeat

> Roaming is .. not that common any more.

Nor is gatecamping because some genius decided that instawarping bubble immunity (with optional covops cloak functionality) was a good idea so people could move through null in virtually perfect safety.

There have been so many changes which I find fundamentally stupid that it's hard for me to engage with the game these days. One day I'll probably play for a bit again though

Most people gate-camp to try to get actual kills, not empty capsules from people just trying to get from point A to point B. And there's plenty of counterplay for insta-warping bubble immunity (i.e. smart bombs). Gate camps are still very common.

Was killing ceptors and pods fundamental for your eve game play?

Edit: Also people stil gatecamp.

I think he's talking about the luxury yachts... interceptors can't covops cloak. (And t3's can't instant warp _and_ have bubble immunity)

I struggled to get in because character progression was based solely on account being active, not play time.

And every guide for the game basically starts with "Subscribe, and don't log in for 3 months while you get your newbie pilot some basic skills, or if you do log in, don't do much because you're useless anyway"

Three. Months. How can a game be basically unplayable for newbies for ninety days?

Meh. Time gated games are never fun. The only thing worse than grinding in the game is having to grind hours out of the game.

>> And every guide for the game basically starts with "Subscribe, and don't log in for 3 months while you get your newbie pilot some basic skills, or if you do log in, don't do much because you're useless anyway"

This really isn't true (at least it wasn't some years ago when I last played). What's true though is that the game is complex and does a terrible job of explaining itself. Unless you already know experienced players, the activities immediately available to you as a new player have no resemblance to the "real" game.

It's up to you to explore, get yourself blown up (a lot, ideally), figure out what you're doing, and find a niche. There's plenty to do as a new player that doesn't require months of skill points.

If it winds up speaking to you, the game is absurdly fun and absorbing. I burned out on it years ago, and still remember specific fights and weird adventures I got into.

Edit: My experience as a new player is years out of date, hopefully it's better now. But the real learning curve is always going to be steep.

It's better now. You have more skill you start with, they removed some stupid skills and there are tons of corps that accept free to play chars.

The basic concept is still the same though. You skill with time. Which is nice, because sometimes I don't play for weeks. No "I have to grind X hours" to be on pair with my friends.

Any guide that said that was wrong. In-game skills help, but actual ability and knowledge matters a lot more. There were guides like https://www.youtube.com/watch?v=de1hwoFYA_k where people used brand new trial accounts to go get good kills.

But yes, they've made it a lot easier to just hop in without the need to train for anything.

There were some pretty significant improvements to the new player experience. Within a few hours you have enough skills to participate in faction warfare or other small scale pvp as a tackler.

I stopped playing EVE recently because of all the micro-transaction stuff. One of my favorite things about EVE was how the subscription model deterred kids and meant they didn't need to have micro-transactions. Check out BjornBee or ZarvoxToral on Twitch, they run public fleets you can join, and seeing as it's free-to-play that's probably the easiest way to decide for yourself. -Ex [IRON] Pilot.

I think they were bleeding subscribers for a while, and had to go to a form of limited free-to-play to keep the numbers up. I m a bit sad that the EVE legacy is eventually going to end, since i think the game is still the most unique MMO out there.

But i don't feel the game has the same dynamic as it had in 2008-2010. But i also quit at that time too, so who knows...

indeed. I've played a lot of mmos, from UO and onwards. Nothing really compares to EVE in the slightest. Perhaps SWG to some extent but EVE is like the Dwarf Fortress of MMOs. lots of levels of details and harsh

> There are other, far more interesting ways, to watermark text than this that are both harder (to impossible) to detect that produce better results.

This sounds interesting, could you elaborate?

The post above mentions some. One good way is synonym replacement. In this method the actual _text_ is altered every time to produce a unique arrangement of synonyms used through out the text. For example I can replace "two" with "2" or "fast" with "quick" to obtain a bit to embed things in. See here for examples:


Another is to alter the _frequency_ of certain letters occurring in the text to produce a unique watermark. For example the "number of i's" that occur in the text can be used to produce a unique text per user. This is a very hard attack to detect or do anything about because even _summaries_ of the associated text tend to carry letter and word frequencies forward.

These methods are also impervious to "screenshot" etc. And because the embedded value can simple be a 64bit key to lookup the users/session info with attacks that attempt to impersonate other are impossible.

I can see a lot of problems with showing different texts to different users in the same place. For example:

- on a forum, if the post reads slightly different to each user, and one user “quotes” it, the quote will be of what that user saw, and other users will be able to identify the fingerprint from the quote.

- if the fingerprinting script modifies all posts on the forum, then a poster will be alarmed to see his words change.

There are clever ways around both of these issues (rewriting the quotes on render, and hiding the fingerprinting from the OP). But eventually the system gets pretty complicated, and ultimately you’re visibly presenting different text to different users, so it’s no longer an invisible fingerprint.

> on a forum, if the post reads slightly different to each user, and one user “quotes” it, the quote will be of what that user saw, and other users will be able to identify the fingerprint from the quote.

Not if the forum intelligently changes the quoted text :)

how do you handle partial quotes? what about broken quotes?

How do you synthetically alter the frequencies of letters?

I'd assume just replace words with synonyms containing the appropriately weighted letter/s?

The hard part is going to be keeping the flow of the text, whilst brute forcing a thesaurus into it!

The easiest is again to do synonym/word/punctuation replacement but with the intent of altering the letter frequency instead of embedding information directly.

It's effectively a compression scheme built on-top of synonym replacement using "extra" available information to pack more bits in less words. This means even sentence long quotes in summary form are enough to compromise someone.

It sounds to me that it would be blatantly obvious that a human didn't write whatever result that may produce.

And I don't quite get how you can encode that much information in a sentence without it being completely garbled.

It seems to me that it'd be very obvious that a human did not write whatever result that could produce.

Here's 5 bits in that simple sentence. I'm sure others can do better.

It now occurs to me: inject common proofreading mistakes that often go unnoticed.

It does seem tough to vary word usage to produce the right frequency of letters while keeping the meaning and not having it read awkwardly.

interesting, I wonder if that is why there are so many typos and grammatical errors in news articles these days

On forging the identity of someone else, the author did mention it and the workaround is pretty simple (using a secret ID for each user):

> There are some caveats to this method of course. For example, if a user knew of the script they could theoretically insert their own zero-width characters and accuse someone else. A better solution would be to insert a unique user ID that is not publicly available instead of the username.

If someone manages to determine the secret ID then it fails. What about signing a combination of part of the message (likely some hash) and the user ID. This would create a secret ID, but one which if stolen can be shown to be false since trying to recreate the signature with their ID and the message would fail. Hash collisions between different messages are still a concern, but aren't their known solutions to that problem that can be implemented in combination?

Can't you just use a whitelist of allowable characters?

What about fingerprinting a photographed text? I'm thinking that by encoding the hidden message to bits and representing them in spaces around some arbitrary anchor keywords from the original text might work. Extracting the message then requires either OCR, either manual work(counting spaces).

It could also be encoded in the choice of serif or sans-serif for each character: http://elonka.com/friedman/

That would have never occurred to me. So many tricks.

I wonder if round tripping texts could be an effective sanitizer. Text to speech and back. English to Chinese and back.

See e.g. Spread Spectrum Image Steganography for embedding messages under the noise level.


Just have the background image have a pattern to it that changes by user: you can encode stuff in a uniform-looking background by making some squares be color (255,255,255) and others be (255,255,254) for example: it would appear to be uniform white but not really.

The classic image stegonography technique is embedding bits in the least-significant-bit of each pixel, such that the differences do not produce visual changes but the information is preserved in the raw bytes of the file. The problem with this technique is that the encoded information will not survive a screenshot (since it’s in the LSB, the colors don’t actually change, just the raw bits).

In order to survive a screenshot (or even some of the common processing steps run after uploading an image somewhere), you would need to visibly change the colors, as you say. This could work but could also produce identifiable artifacts.

(Actually, now I’ve typed this out, I’m not sure if a LSB stego image wouldn’t survive a screenshot. I guess it depends on how the screenshot function is implemented — actually would be curious if anyone has tested this.)

LSB alterations would be in lossless encoded images, like PNG (of the bit-depth was sufficient); but not in lossy image formats like JPEG.

Right, and since most screenshots are jpeg, that means LSB would likely be lost.

At least on macOS, screenshots are PNG. Does Windows make JPEGs?

The Snipping Tool included with Win 10 defaults to saving as PNG

Has been done, see my comment above.


The images are 404, but they used a background with slightly different color and something similar to a QR code.

The background trick is cool, but what happens if the user removes it entirely? Lexical Steganography might be a more suitable approach: http://web.mit.edu/keithw/tlex/

Just skimmed it, but is that not close to method #2 the same Alliance used: http://failheap-challenge.com/showthread.php?16311-Taking-th...

Could you talk more about why this was useful for Eve? What's an example of how you could blame innocent folks? And how could you sign the inserted data – do you mean cryptographically?

Lets say you are an Cuban spy inside the NSA. While you copy memos you notice a watermark with your personal NSA employee ID embedded in the paper.

The NSA might have spies inside the Cuban intelligence services. If you remove the water mark, the NSA will you that you know about their counter intelligence.

If you leave the watermark in they know you are the spy. But, if you change the watermark to some ID from another employee there will be no downside for you. And you start tension within the NSA.

Maybe you can even turn the burned innocent employee, because he will be pissed at the NSA.

Eve is like that.


Reminds me of a story. That one time there was a large battle in eve. If you want to repair / heal a friendly ship in eve you have to target it, and press F1 to start your remote rapair module (healing spell).

In the heat of the battle the pilot Ivory, a healer from Team A, messed up and accidentally targeted the enemy ship his fleet what shooting at the time. He did not notice his mistake and shouted "All reps on Cain!" (Use healing spells on the pilot Cain) Cain was from Team B.

Counter intelligence officers from Team A thought that our healer Ivory was a spy and just pressed the wrong push-to-talk button in TeamSpeak. They sifted through logs and found out his IP, he was from Alberta, CA. Just like the leader from TeamB.

Team A set a trap and killed a expensive ship from Ivory .. and kicked from the Alliance (Guild)

Ivory was not a spy :) Internet spaceships are serious business.

How did you find out he wasn't a spy?

I think a week later Team A apologized, replaced his ship and took him back.

But the accusation was stupid to begin with. He was a loyal member for years, living in the same Canadian state as another eve player is not really uncommon any shouting on TeamSpeak who is currently getting shot would not have been helpful.

Diplomats from Team B said he was not a spy.

Spying is a big problem in Eve. At the most basic level a spy is able to take screenshots and copy/paste text to send back to the entity they're spying for. By watermarking both text and forum backgrounds the data effectively becomes tainted in the sense that the screenshot/text will have unique characteristics that allow the original poster to identify who copied the data in the first place.

The whole flow would be:

- X is a player in alliance FOO but is actually an agent planted there by alliance BAR.

- Y is a player in alliance BAR but is actually an agent planted there by alliance FOO (or any other).

- X copies intel from FOO's forums and sends it back to the people in charge of the spy program in BAR.

- That intel gets shared with key personnel in alliance BAR, so they can take action based on the gathered intelligence; unknown to BAR, among that personnel is Y

- Y sees that intel and sends it back to their leadership in alliance FOO

- The people at alliance FOO identify the unique data on that intel and track it down to X and proceed to kick him for being a spy.

As someone who is involved with alliance leadership stuffs: This may sound convoluted but it's really bread-and-butter level stuff in Eve, it can get significantly weirder. For example this article discusses whitespace character-based fingerprinting: that's amateur level, like described elsewhere in this thread.

To answer your question: counter-intelligence. If your method for tagging data is known and easy to replicate, such as watermarking your forum userID in screenshots, the people at alliance BAR can edit the screenshot (or forge a new one) where they insert the userID of an innocent person in alliance FOO in the gathered intel, this way when Y grabs the data for sending back to FOO, they'll be unknowingly sending evidence that incriminates a loyal member.

edit: I noticed I didn't address your last question. Signing data is just regular cryptographic signing yeah, in the above example to prevent tampering you'd insert the hash from the userID plus a secret salt for example. You just need some way to prevent the hostiles from incriminating someone else.

> The people at alliance FOO identity the unique data on that intel and track it down to X and proceed to kick him for being a spy.

That, or deliberately feed misinformation to the spy.

Screenshots made in World of Warcraft contain a watermark which contains the account number of the player, as well as other info [1] [2]. This is used to find/combat cheaters (e.g. botters).

As for spying, I'm pretty sure this is a problem in high end raiding in WoW. But due to the nature of WoW (not being a king of the hill MMO to rule land) not nearly as much as EVE. Blizzard uses it to combat cheating.

[1] http://www.tomshardware.com/news/watermark-screenshots-World...

[2] https://eric-diehl.com/world-of-warcraft-and-watermarking/

Couldn't someone just black out the watermark?

In this case not. The watermark is all over the place, viewable when you zoom in. The alternative is to use a 3rd party tool to make screenshots such as Greenshot, ShareX, or Gyazo.

Could "fingerprinting" a screenshot be foiled by running the screenshot through a lossy algorithm? Or doing something like converting to jpg->gif->jpg->gif->png->jpg... a few rounds? Enough to keep the picture viewable, but just barely?

Or hell, even simpler; taking a cameraphone picture of the screenshot?

Depends on the type of fingerprinting used. The example I mentioned with the invisible pattern yes but if the fingerprinting is done via other means such as replacing some words on a post with synonyms for each user in such a way that each has a unique combination of words it wouldn't work.

Generally speaking in these scenarios you don't want to grab anything directly from the source, just relay what you saw and write it down in your own words. Even then it has to be handled carefully (how many other people got this information? Are the details I'm seeing slightly incorrect in order to filter out who leaked this?)

Also take into account that you can be caught just by repeated "A/B testing" of sorts: half the population gets informed that at 19:05 some operation is taking place, the other half that the op is taking place at 19:10. The next day they do the same but using different groups, if you have access to the data that is being leaked you can track down who is leaking it like that after a few iterations.

The in-game stuff is the tip of the PVP iceberg as far as Eve goes. Whole alliances have been killed off using a handful of reddit sockpuppets impersonating their members, brought down by a player flipped at a social event, or simply imploded under the weight of internal drama stoked by rivals. Often all 3 at the same time.

It's a very interesting game, built on a really crappy space themed spreadsheet.

The advantage of putting the (signed) full username in there is that you can demonstrate to a third party (like a court?) who is to blame. Just encoding a number for each user, or giving each user a different number of Es is deniable to a certain extent.

But such a simple system would make it easy for one user to frame another.

You would need a way of signing the username. Perhaps you could hash the user ID with a secret salt, and hide that in the text along with the username.

I dont have access to the paper, but wouldn't unicode normalization (and zero width deletion) effectively remove all watermarking?

which alliance were you in? :)

How difficult would it be to write a browser extension to either remove all zero-width characters or somehow make it super obvious that they are being used on the page?

I just searched for "zero-width" and "zero width" in Chrome and Firefox's extensions stores, but didn't come up with anything.

I just made a very basic one: https://chrome.google.com/webstore/detail/icibkhaehdofmcbfjf...

Code here: https://github.com/roymckenzie/detect-zero-width-characters-...

Submit a PR! I know it could be better!

Needed features:

* right-click selected text and "Sanitize and Copy" * toggle off and on ...

Probably not too hard. I've made a jsfiddle to identify and remove such stuff. Feel free to copy any of it to an extension. https://jsfiddle.net/tim333/np874wae/13/

I took your code and added to it. Removed jQuery and it now searches through ALL text nodes on a page and replaces the text in it with the hidden characters plus the original visible text. There's still lots of room for improvement but it's a start. For now I'm running it in the console to make it work but someone can build upon this and turn it into an extension (maybe I will if I have time)

I put it in a gist:


Would probably be better to do this at the OS level, no? Just ensure that shift-cmd-V/shift-ctrl-V strips zero-width characters in addition to formatting. I can't think of a situation where I'd want to keep one but not the other, and you could always do that manually if it came up.

Arabic? I think the are cases where letters are joined or not by default, but sometimes need to be forced to the other state, depending on meaning that the text layout engine isn't expected to be able to figure out.

If it'd also optionally (default=true) strip text formatting from copy/paste that'd be epic. No more copy-from-browser-then-paste-into-notepad-then-copy-from-notepad-then-paste-into-email-or-chat

What matthberg said. I'm on a Mac and I use shift-cmd-V frequently (more often than cmd-V).

Unfortunately, there are a few apps that use cmd-opt-shift-V instead of cmd-shift-V. You can fix most of them using this:


After that, almost everything will use cmd-shift-V. However, Microsoft Word is still broken, apparently because it uses a slightly different command that does a similar thing, "Paste and Match Formatting". Haven't found a way to fix that yet.


In a good few applications that is exactly what Ctrl+Shift+V does. Try it in Chrome or Firefox, it pastes without formatting.

Zero width characters have legitimate uses such as family emojis and certain Indic scripts. You don't want to break these.

I'd rather be protected from tracking than worry about some emojis breaking or text in script I can't read formatting incorrectly. Of course, this should be an individual decision since some people do read those scripts and some people do care about emojis.

I don't think that it's too difficult. Just check every tag's text for forbidden characters and replace them with something. I'm not sure about performance.

Besides performance concerns, if you did it on each page load it would miss any content added to the page after the initial load. It would be useless for SPAs.

The best way I can think to implement it would be listening for a copy event and replacing the text in the clipboard.

Something like:

  document.addEventListener('copy', sanitizeClipboardContent);

No need to replace text. Patch all the fonts in your OS to display something instead of zero width. That will work outside the browser also.

There really is no valid use case in Latin script so why is zwsp allowed next to Latin characters?! (Emojis is not a valid use case, and why do they depend on zwsp anyway?)

Whenever I paste, I use ctrl-shift-v. It solves this problem and a multitude of other pasting problems.

That just removes formatting, reducing the copied text to plain-text. If I paste the strings from the blog post into atom and then arrow right from the beginning, there are locations where the col number will increase and the cursor not move.

I don't think ctrl-shift-v will do what you think here because we're not talking about formatting at all. We're talking about regular unicode UTF-8 encoded text.

that doesn't work on my computer (Fedora), but what I usually do is paste it in a box that doesn't allow formatting and then copy it again... but I don't know if it would remove fingerprinting in this case

This sort of thing is one of the reasons I never liked the "noise texture" that appeared on MacOS X and other GUIs and websites not so long ago. I always thought my (former) OS was fingerprinting every screenshot I made. I'd love to be proven wrong, but you are never too careful.

This reminds me of a few years back when the internet identified a parody twitter account by analyzing iOS screenshots it posted.

I just tried googling for the story but I can’t remember what the account was about. I think it was some sort of parody silicon valley account. It was a great story, if anyone remembers and can find the link.

I remember that Blizzard had secretly watermarked WoW screenshots.


Also be careful of copy-pasting bash commands or install instructions to your terminal, they can contain hidden zero-width malicious commands, as well as a newline at the end to make the command run immediately. Ohmyzsh on my machine detects copy-pasted text and warns you.

I thought this was also done by adding spans in the text set to display: none. You will still copy that text without selecting it:

  cd /tmp;<span>rm -R ~/;</span>ls;

Yes, I actually confused the two concepts and my comment is a bit misleading. This exploit is only done using display tricks, NOT using zero-width characters. Zero-width characters are very limited and can't actually spell out commands (to my knowledge).

It is though, still another reason to be careful when copy pasting.

Now I'm thinking if you can somehow put an ESC character in text so that when you copy-paste it into vim, it goes to normal mode and starts performing commands. Hmm...

Here are PoC exploits against various editors:


Even pasting to cat(1) might be insecure. The paste can contain ^D, which will make cat quit; then the rest of the paste will be interpeted by shell.

Wel I think you can just insert the escape 'character' into the text.

At least Firefox 52 doesn't copy the "display: none" stuff. You have to use other means to hide the text.

Don't forget the fun you can have making JavaScript look innocuous:


Yeah, OhMyZsh also warned me on my Mac. I definitely think it is a must for Unix environments.

The first time I met zero-width characters, (I suppose this was long before they became popular for "fingerprinting" text) it was in a weird bug where some javascript would fail due to a \u200b being present in a user-entered string (it was easily fixed by changing the method that we used to sanitise strings). I remember thinking "wow with these zero-width characters you could do steganography within text, even in a very short string". It looks like I wasn't the only one who had that idea.

A good utility to use to combat all clipboard-related exploits is one of Apple's example code: ClipboardViewer[0]. You can see a screenshot of it displaying the copied code from the demo in the article here[1].

Besides being able to see the hidden characters, you can also see the internal "layers" of clipboard, e.g., how can a rich-text sentence be pasted to both a plain-text editor and a rich-text one.

[0]: https://developer.apple.com/library/content/samplecode/Clipb...

[1]: https://s3.andyfang.me/screenshots/clipboard-viewer.png

You can also see the unprintable characters at the command prompt by piping the data through `cat -v`.

This would be an interesting approach to plagiarism detection; I could see how it would be used for a couple of online courses that I use with my students. Of course its just part of the arms race, though.

My thoughts exactly.

IBM did something similar with unused high order bits in a firmware image that Memorex was accused of copying. This was the first time I've seen zero width characters used, presumably you could build a brainfuck compiler that would let you write code as zero width spaces :-) Then you could have an invisible script inside your document. Fun but not particularly useful.

EDIT: Or a whitespace interpreter (https://en.wikipedia.org/wiki/Whitespace_(programming_langua...)

My favorite use of invisible characters was to enable spoilers in Facebook without littering the post with garbage.

First line explains it's a spoiler and for what, hundreds of invisible characters, actual spoiler.

That way FB would just show the first line followed by "read more"

That's a really interesting technique!

I'm trying to think of what else could be done with the encryption / description, but tracking is a really effective use case.

Could probably encode some other secret messages in there, make a blog post about cheese include a hash to a pastebin.

It also reminds me of the importance of having strong validation around things like usernames, because if I had a username that looked official but contained an invisible character... Related: ICANN explicitly forbids domain names from including zero-width space.

Not much new can be done encryption wise as this falls more under the category of steganography which is security through obscurity.

If anyone isn't too keen on reading the article:

Source Code: https://github.com/umpox/zero-width-detection

Demo: https://umpox.github.io/zero-width-detection

Although the demo works, if you just copied part of the text your username isn't taken with it. I think a better way would be to insert the zero-width characters in between the letters and repeat it throughout the text.

Was just thinking this. This would be especially useful with plagiarism detection.

In addition to the use of Diff Checker mentioned by the author, spell-checkers will also highlight words that are broken up by zero-width characters.

Sublime Text package "Sublime Gremlins" [1] detected and highlighted the zero-width characters: https://i.imgur.com/LNlcgRK.png


[1] https://github.com/redoPop/SublimeGremlins

Interesting is how to defend against this.

If you are a journalist wishing to protect your source, what tool could be used to process content such that the essence is left intact but the unicode zero-width steganography is stripped... replaced instead by the common space character.

I know enough to say that you cannot just search and replace, as many of the zero-width characters have a meaning in different languages and produce a visual effect when combined with other runes. Just stripping them all will break text in those languages.

Is there a method for removing zero-width whitespace such that journalist sources could be protected?

The safest thing is to retype it. But that doesn’t cover the risk of synonym/frequency fingerprinting discussed elsewhere in this thread.

What if you also did a random synonym replacement throughout the piece to destroy the watermarking? If the source is anonymous and hidden, then authenticity cannot be checked by the reader anyway, and so replacement without changing meaning is an acceptable change to protect sources.

Welcome to the new age of computer-powered math where anonymity is not a guarantee.

Plenty of discussions in this thread show why. Anonymous screenshot or picture? Image watermarks -> anything from subtle color changes, to sans/serif font changes, to word replacement. Anonymous text or summary? Word replacement, letter biasing encoding, etc. Anonymous audio? Inaudible frequencies, audible waveform deformation encoding, etc.

Nothing's out of the realm of possibility for the paranoid.

You would have to change every word, since any could be a waterprinted synonym. A better way would be to read it, make a summary, then rewrite it from memory and only use the source data to correct factual differences.

You wouldn't necessarily have to change every word; just enough to break the decoding scheme. But even then it's totally random, so the longer the document, the more opportunities to be fingerprinted. It's like the old saying goes, "the police only need to be lucky once, but the criminals need to be lucky all the time." [0]

[0] I never bothered looking up where this came from until just now... interestingly it's from the IRA, and used in the total opposite way most people use it now... https://en.wikipedia.org/wiki/Brighton_hotel_bombing

Your saying illustrates what I'm saying perfectly: if you miss even a single fingerprinted word, it might uniquely identify you. So you need to change every single word, and even that isn't enough in case eg. adjectives were added or omitted.

So paraphrase it.

In my article that Tom[0] links to I mentioned a number or methods, but you may be interested in a followup piece I wrote as well:


Essentially the best method against these fingerprinting techniques isn't technical—it's just not re-sharing the contents in the first place. Sorry, small differences are just too easy to embed and there are so many bits to work with.

[0] Thanks Tom, you the man!

Print it and use OCR?

Years ago I worked on some software that would adjust the kerning on text slightly to embed the name of the user who printed the document.

Kerning probably doesn't survive OCR though.

You can still embed a person's name by replacing words in the text by synonyms.

Where can I find a full list of zero-width characters?

It looks like there are just four: https://en.wikipedia.org/wiki/Zero_width

Not that the zero-width characters are not the only characters that one is unlikely to notice. I regularly use the RLE, LRE, and other non-printing characters [1] in my text.

[1] http://dotancohen.com/howto/rtl_right_to_left.html

In addition to RTL/LTR mark characters, there are also TAG characters[1].

[1] https://en.wikipedia.org/wiki/Tags_(Unicode_block)

Whitelisting might be better...

Not if you want to use them.

What was the original purpose for these characters' design?

I use the zero width no break space with formatting languages to format only parts of words. Something like _rest_ful to render the first part in small caps may not work while _rest_ ful will, of course but renders as two words. Replacing the space with a zero width no break space solves that.

Most certainly HTML layout hacks, like 'I want to create a table with line height but zero width' ;-)

(just kidding)

Interestingly, pasting text with zero width chars into an iMessage chat box does alert the user by showing "question in a box" chars for the zero-width chars.

Most leaks are screenshots :D

But seriously, ZWNJ is really hard to see when you copy text. The only way to sanitize it is to run it through a program.

I'm surprised I can't detect them on notepad++

I switched the Encoding to ANSI and it showed me the pasted text like this:

F?or exam?ple, I’ve ins?erted 10 ze?ro-width spa?ces in?to thi?s sentence, c?an you tel??l?

I can confirm that Vim shows the hidden characters as question mark.

So does Visual Studio Code

Vim, terminal, or font?

This reminds me of the times I've had to help beginning students debug very confusing errors because one of these characters somehow found their way into someone's source code (likely because a word processor was used to edit it at some point.)

The so-called-better-way of doing this using Unicode substitution can be found at http://smartdata.cs.unibo.it/watermark/

I like the concept. Please correct me if I'm wrong: it looks like you'd need a lot more than "46 to 101 characters" in a message before you can apply this method reasonably.

An MD5 hash for any decent-length password is long, and this method only allows you to replace the subset of "confusable" latin chars in a text.

For instance: C = 0x0043 = 0x216d

When you reach one of these replaceable characters, you either replace it or you don't, which you mark as either 1 or 0.

So for the string "password", our binary MD5 hash is "01011111010011011100110000111011010110101010011101100101110101100001110110000011001001111101111010111000100000101100111110011001".

That 128 possible replacements needed in the original text.

I imagine the original text would have to be at least 10-20 times that length before we found enough "confusable" latin chars to replace.

I'm eager to hear what I'm missing, because I do like this method a lot.

Unicode should not have invisible characters.

Then how would you indicate that a space or newline is intended? Or how would I indicate that a length of text is to be displayed from right to left when I post Hebrew text to an English-expecting text field?

> Then how would you indicate that a space or newline is intended?

They aren't invisible, you can see the result as spaces and the next line.

> how would I indicate that a length of text is to be displayed from right to left when I post Hebrew text to an English-expecting text field?

That is also a visible effect.

Nope, not always. Try putting a newline in HTML, it is ignored.

> ignored

Not exactly, it still serves as a word separator. Like the newline (not a space) I put between "word" and "separator".

To be fair, that’s because HTML is a descendent of SGML


Then how would you enter a ZWNJ - for example to correctly indicate that the ligatures should not cross the morpheme boundaries?

I would make a morpheme ligature such as æ a separate code unit. Oh wait, it already is!

The postscript mentions a unique user ID. Ideally you'd just want a hash of the user ID, some private key, and possibly a session ID. The quick and dirty method works great for a one off though.

this haphazard attempt at DRM seem like a perfect OpSec layer in case the purpose is to throw off the gullible media or any "experts" overly keen on cyber attribution. The method seems flawed unless you actually want your adversary to find it and make incorrect assumptions, it's probably not the right tool for the job.

EDIT/PS: the first one who finds the homoglyphs within this very status update and posts it in the comments wins an iPhone which has still not been patched against the జ్ఞ‌ా vulnerability ;)

in the post this apostrophe: [‘] is used instead if the standard ['] apostrophe or [`] grave. It's pretty trivial to check just by opening it in any non-Unicode supporting terminal. In fact the zero width characters show up explicitly as blue <200b> in vim.

People always throw around language compatibility when the unusual features of Unicode are thrown around, but stuff like zero width characters really don't need to be supported for language compatibility.

On another note, if Unicode is willing to butcher Chinese/Kanji orthography with Han unification then it ought to be willing to get rid of Latin homographs.

Zero width characters don't seem to have a use case for most web browser users. It seems that they should be filtered out, and an option to display left in about::config

edit: Other have pointed out that there ARE legit uses of zero-length chars, especially for other languages. Still, I bet a solution does not have to be all or none.. I bet common legitimate use-cases can be segregated from others.

Just when I thought I was aware of all major data leak pathways, something like this comes up and it leaves me dumbfounded.

Chrome on Android put a line break in the middle of a word. I assume that was an "invisible" character?

Nice to fingerprint text and avoiding that it is just copy/pasted around the web

Related HN thread regarding a chrome extension to counter this:


How do I see those invisible characters in emacs or vim ? In emacs I thought that whitespace-mode would do the trick but apparently it doesn't.

vim (7.4) seems to display them by default. With my whole vimrc commented out (just to be sure it wasn't a setting I changed), I get this:

    F<200b>or exam<200b>ple, I’ve ins<200b>erted 10 ze<200b>ro-width spa<200b>ces in<200b>to thi<200b>s sentence, c<200b>an you tel<200b><200b>l?
(The <200b> is also highlighted a different color than the rest of the text, and acts like a single character when moving the cursor through it. It's really, really obvious.)

Indeed, I copy-pasted in emacs but not in vim, my bad...

In emacs those characters are by default visible as one-pixel wide spaces, to make them more apparent eval (update-glyphless-char-display 'glyphless-char-display-control '((format-control . empty-box) (no-font . hex-code))).

To view it, I used:

    echo "Ctrl+V" | hd
Which makes a hexdump. Easily visible there.

If you're looking for Vim or Emacs specifically, I don't know.

Otherwise known as steganography. It can be also done with images, text, sound, DNA and virtually anything that carries information.

    const zeroPad = num => ‘00000000’.slice(String(num).length) + num;
What a bad way to declare a function. Starting with the word 'function' will make the code much more readable. Arrow functions are supposed to be used as a small callbacks, not to obfuscate the meaning of the code.

If you spend a few days reading code written in this style, you will quickly get used to it. This is not “obfuscated”, and there’s nothing wrong with using this syntax for general functions. (Personally, I would recommend putting the function body on the next line, but meh.) If you have any kind of reasonable text editor, you can add syntax highlighting or other visual call-outs for functions, if that is important to you.

I was more confused by the parameter being named 'num' before having the String function called on it and its length taken - when in fact 'num' should be a binary string when it's passed into the function. But maybe I've been spoiled by strong types.

The rightmost num is coerced. And the hole is finished on par.

function keyword carries a lot of stuff: this variable, arguments variable, function name can be accidentally changed later. constant lambda is easier to understand.

I'm from Haskell. I love it.

Would someone please share a quick image that shows a visual example?

so, we'll copy text to editor, screenshot it, OCR it and then paste? did I get it right?

   Zero-width characters are invisible, `non-printing' characters that are
   not displayed by the majority of applications. F*or exam*ple, I've
   ins*erted 10 ze*ro-width spa*ces in*to thi*s sentence, c*an you tel**l?
   (Hint: paste the sentence into Diff Checker to see the locations of the
   characters!). These characters can be used to `fingerprint' text for
   certain users.

Above is what paragraph looks like in text-only browser in VGA textmode.

That's just because the browser doesn't handle unicode well. The 'text-only', 'VGA' and 'textmode' are actually irrelevant. The behaviour you are seeing is down to programmer choice/laziness/missing support.

If you're only expecting ASCII text, then not being able to show anything else could even be considered a feature to reduce attack area, since any sort of Unicode trickery then becomes impossible.

There are more languages in the world than English, not supporting Unicode is not an option for most.

Yes, but thread OP was essentially saying "look how those chars appear when you can only render ASCII." And then the person you replied to said that explicitly.

I'd wager that both of them know there are more languages than English.

So what happens when you try to display Chinese or Hindi text? You're just going to get a screen of stars, like you would with unicode square boxes.

"If you're only expecting ASCII text" categorically excludes trying to display Chinese or Hindi text from the program's specification.

   s/text-only/ASCII-only, no Javascript/
   s/browser/& I use/
   s/in VGA textmode//
@userbinator is correct; I often find this is advantageous for me and neither need nor want Unicode "support" in ths program; I have other programs I use to view non-ASCII characters if the need arises.

I am in agreement with this programmers choices, for the most part. Certainly I agree with this one.

Links 2.8 <http://links.twibright.com/> also shows the non-printing characters as big black asterisks. But Emacs doesn't show them in text-mode, have to switch to hexl-mode.

Yeah but so what? Almost exactly 0 people use text-mode browsers.

Applications are open for YC Summer 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact