Hacker News new | comments | show | ask | jobs | submit login
Zero-Width Characters: Invisibly fingerprinting text (zachaysan.com)
572 points by based2 10 months ago | hide | past | web | favorite | 146 comments



This is a pretty common method of watermarking sensitive content in EVE Online alliances, however we've found it suffers from some serious drawbacks.

Zero-Width characters tend to cause lots of issues when they are copied and pasted, which may alert a poorly equipped adversary that they're handling watermarked content. In addition, entities that are aware that you're watermarking text content in this way can just take screenshots of the text, transcribe it, or strip it of all non-ASCII characters.

The best solution I've seen is something I like to call "content transposition". The idea is that you take a paragraph of content and run it through a program that will reorder parts of content and inject/manipulate indefinite articles in order to create your watermark, while keeping the content grammatically correct. That way even if an adversary is fully aware that you're watermarking text content, they need two copies of the watermarked text in order to identify and strip your watermark.


Almost a decade ago I wrote a tool to search forum dumps from various EVE Online alliances. The content was acquired by spies and often watermarked.

The first barrier was that homoglyphs would inhibit text search, so I had to build an automated homogylph detection and substitution layer.

Once homoglyphs were stripped, the challenge was then to fit the entire search corpus into memory, so I compressed each page with LZMA, loaded it into memory, and decompressed on the fly when searching—probably not optimal, but still way faster than loading from disk.

I always wanted to try reverse engineering some of the watermarking systems so we could modify the watermarks on certain material, subtly leak it, and effectively frame adversaries while protecting our own spies in the process. Fortunately or unfortunately I never got around to that.


What kind of content would be distribution that way? Something like battle plans send to individual members? Would they not be able to just snitch the important information without copying word for word the sensitive message?


There's two kinds of text-based data that counter intelligence ops try to protect: Forum posts and 'fleet pings'.

For forum posts, an adversary can do exactly what you say - just take the meaning of the forum post and write a cable on it.

Fleet pings are how massive fleets are formed in EVE. Basically they are a @here ping on Discord, or a server announcement on XMPP or IRC. Fleet pings are important because if you know when your enemy is pinging for fleets, you know when and where to find their fleets to fight them.

For that reason, the value in fleet pings is very time sensitive, and cannot be transcribed by a spy in real time. Usually orgs will set up a "ping relay" channel for its top fleet commanders, with a bot that relays pings from enemy alliances. This is a very common interception point for obtaining watermarked pings and burning spies.


It's a TON of data. When I was in an alliance during my EVE days we had TONS of "classified" information. It would consist of systems, contents within those systems, daily numbers for corp and alliance transactions, influx of cash, spies, plans, user information, capital and super capital ship information / plans; EVE is such a huge and complex game that many joke that it's a "second job". Well, that means they also have a TON of data that can be used by other corporations / alliances.

Such an interesting game. I played for years and always want to go back.


Do you happen to have a link to these "content transposition" tools?


No, counter intelligence tools in EVE are a closely guarded commodity. The automatic ones I've seen leave a lot to be desired. There's one group that I'm nearly certain is running one that works well but I don't have evidence to back it up.

One group decided that writing an automatic transposition program was too hard, and instead manually create transpositions for major posts. You can find the writeup here: http://failheap-challenge.com/showthread.php?16311-Taking-th...


I can't believe people still read that stuff I wrote :o


You're the only person who ever writes about exploiting people's CI tools, so I end up quoting your work a lot.

Would love to see a swing taken at Pandemic Legion's new forum background watermark, it's a lot more interesting than the old one.


We wrote automatic tools too ;)

Some were identified and others... not


I was wondering that too, because even the example of adding / deleting indefinite articles and remaining grammatical would be very difficult.


That same type of "content transposition" tool would also be a way to ensure obfuscation of content source if applied to copied text.


I know there is no way I have the time or energy to play EVE but it is the most fascinating game (MMO) that I have ever seen. The writeups on it are so fascinating.


Back in the Usenet days, around the late '80s, I had read somewhere/somehow a few years before about how classified information would be printed with tiny differences in spacing to track leaks to foreign adversaries. In some newsgroup, I happened to mention I had read about this technique, which while obvious once you think of it was apparently not well known. Certainly by that time, it was very well known to the KGB.

Months after my public comment, I got a phone call from an AT&T inventor who was prosecuting a patent on the same technology. They were very interested where I had read about the technique in the public literature. Alas, I could not remember where I had read that little factoid so I wasn't much use to them. It was disturbing to them that their patent claim was out in the open literature somewhere, but they could not find it.


The general idea is called a Canary Trap, and IIRC it was popularized by Tom Clancy. The most common form is to have some subtle variations in the content or wording, which would give away the origin.

Edit here we go https://en.m.wikipedia.org/wiki/Canary_trap

The Canary Trap, aka The Barium Meal.


One of the alliances in Eve online used to do this type of counter intelligence to assist in identifying spies releasing internal communications. They’d also introduce invisible watermarks in any images they posted.

My understanding was their leadership had a tool to help them select a few synonyms in anything they wrote, and a few synonyms was all it took to identify the account releasing the communication.


More than one. PL, Goons, Test, various Russian alliances, and some smaller players all would. I really loved EvEs (counter)intelligence scene, it felt like a great place to safely dabble in real tools and techniques like this.


[flagged]


I spent far too long looking at your comment history trying to figure out if you really were the author (since others have shown up here) until I stumbled on something semi-verifiable and visited Wikipedia (which should have been my first stop), only to learn the author Tom Clancy passed away a few years ago. :)


You would be shocked to learn how many Twitter users don't bother doing either before writing to me.


Whitespace chars at the end of lines was used to encode formatting information; I think clari.net news services were the primary users.


Isn't that mentioned in The Cardinal of the Kremlin?



I think they mention it as being ineffective. The method mentioned in the book involves inserting inflammatory quotes into different versions of a report, in the hopes that the quote will show up somewhere.


Some printers do that, it is very expensive and mostly moot because of electronic documents.

See https://qz.com/1002927/computer-printers-have-been-quietly-e...


parent was discussing about tiny differences in word/letter spacing, not invisible dots.


Notably, some zero width characters tend to get removed, especially in systems that try to remove excess whitespace. I made a very rudimentary PoC of encoding data in zero width characters, but was hit by a few things:

- Some characters affect word wrap in unexpected ways, depending on the script of the text.

- Some characters impact glyph rendering in minor ways. For example, the ligature between f and l may be interrupted.

- Some characters are outright stripped. For example, Twitter strips U+FEFF.

- Zero width characters often trip up language detection systems. I noticed that Twitter detected my English message as Italian with the presence of some zero width characters.

So it's not necessarily as useful as it seems. If you pick specific characters to strategically avoid these issues, it's hard to make the encoding very efficient.

Still, it probably has it's uses.


Related: we built a homoglyph linter for Go source code, to help detect potentially malicious homoglyph substitution: https://github.com/NebulousLabs/glyphcheck

UTF-8 source code is nice for i18n, but it also opens the door to these kinds of attacks.


That’s a good start, but unless I’m misreading it[1] the range of homoglyphs it checks for is rather small. You might be better off importing the Unicode Consortium’s list of ‘confusables’[2] if you’re planned automated linting.

[1] https://github.com/NebulousLabs/glyphcheck/blob/f6483dd9e97a...

[2] http://www.unicode.org/Public/security/latest/confusables.tx...


Whereas, in Perl, we have a module for executable whitespace: http://www.perlmonks.org/?node_id=270023

Damian Conway is a wonderful mad genius.


I've thought about using whitespace and/or zero-width characters to embed a cryptographic signature. The goal was a was a browser extension that could sign the contents of an arbitrary <textarea> like an invisible "gpg --clearsign". The signature rules would have to be relaxed to accommodate common transformations servers do to user comments.

Ideally, this would allow people to cryptographically sign comments and automatically verify comments that were signed by the same author, all without changing existing server software or adding ugly "-----BEGIN PGP SIGNED MESSAGE-----" banners.


Ok so the high level goal is to be able to inject extra data into a message (which could include a signature). Add in a magic string and then a browser extension could detect it and decode it (maybe add a little badge inline). From the API perspective you'd want a standard way to transform an arbitrary block of text and data you want to attach to it into a text+data blob that still looks and feels like text to humans, and then for the reverse, regex for the magic string and decode from there to the end of the message to get the original text and data out separately.

One way would be to place N zws (zero-width spaces) between each original character, and treat each block as a digit which encodes a number. This could work for large original texts, but it would be clunky and very low bandwidth I think. E.g. if "." is zws, you could encode "fox" and the number 123 as "f.o..x...".

Better I think would be to create an alphabet with several of the zero-with characters and put the whole encoded number somewhere where it's unlikely to get trimmed or mess up line breaks (probably near the end in the interior but on a word boundary next to a space).

The hardest part would be making a transformation that wouldn't simplify it too much but would still be resilient enough to the transformations done by many forums like markdown/bbcode/trimming that the result could be perfectly converted into a PGP message. Maybe include some error correction?


This looks like an amazing idea, why didn't you proceed further? I would love to see something like it, that could help certify messages :)


Amazing idea but very ugly per se. The thought of mixing "human semantics" with signature in an whole text sends shivers down my spine.


It's pretty similar conceptually to how we sign email, except that the signature would be invisible to humans.

Of course nobody ever said how we sign emails is beautiful.


This isn't as new as the author thinks. Doesn't have to be new to be interesting, of course :)

I know this was done at one large tech company around 2010 for an internal announcement email. Different people got copies with slightly different Unicode whitespace, despite the email having ostensibly gone directly to an "all employees" alias.

The fingerprinting was noticed within half an hour or so. (Somebody pasted a sentence to an internal IRC for discussion, and as is the case with Unicode and IRC, it inevitably showed up as being garbled in exciting ways for some people).


Very good problem description and nice list of countermeasures!

However, the following countermeasure made be wonder:

> Manually retype excerpts to avoid invisible characters and homoglyphs.

Isn't this something you can automate? We should create linters for plain text (rather than code)? For example, depending on the language, reduce the text to a certain set of characters. Every character not in this whitelist is either replaced, or causes an error message the user (journalist) needs to deal with (i.e. remove it, or replace it with an innocent alternative, perhaps even proposing this replacement back to the linter project).

Of course, there are multiple ways for linting, which might become a fingerprint on it own. But then, if there are only 3 or 4 of such linter styles actually used (ideally, standardize on exactly one linting style), you can only tell which linter was used by the journalist, without any information about their source.


Hi, I'm the author.

I was kinda on the fence, but I was considering my target: Journalists. A journalist may not notice simple differences like an extra space here or there when reading, but probably wouldn't retype a double space. I agree that this should be automated in some way, but it's a bit of an arms race.


What kind of arms do you envision for normalizing text down to Latin-1, or even ASCII, while normalizing any whitespace to single space characters?

There is a known variant when every subscriber of a confidential text receives a slightly different copy with the same meaning. But it's much harder to implement, and does not scale.


Not that much harder to implement. You could automate that too with some basic text substitution. Eg replacing various instances of "and" with ampasand or plus sign. You could also vary joined words like "without" / "with out", and alternative between the types of quotation marks used (as there are several in unicode).

And this is without breaking into more intelligent heuristics where you swap out synonyms ("more intelligent" because you'd need to be careful not to alter passages that need to be kept verbatim, like quoted text or where a synonym might alter the context of the sentence.but with a little care I think that is achievable as well)


I'm curious as to why you think a linter tool for sanitizing the text is less effective than the other measures you describe. As long as the target problem is zero-width characters, this seems like it should be more effective than all but #1. I would agree if you include the synonym or other lexical fingerprints.


You might be interested in reading my update:

https://www.zachaysan.com/writing/2018-01-01-fingerprinting-...

Humans make fixes that are hard to codify in programming. I know it isn't perfect, but with my audience in mind (journalists) I thought it was probably safer than something automatic.


Convert to image and run it through an OCR? Needlessly complicated, but it could possibly mimic the retyping part.


Why not just a regex that matches on ASCII characters and removes the rest?


I think the general problem with any automated solution is that there is so much room to game them. For instance, I could selectively replace a few visible ASCII characters with non-ASCII look a likes. Then, the investigators just need to see which characters are missing. Even with the OCR option you could selectively add typos.


How about back and forth through a translation app or two?


Because you still probably want accented characters and other unicode elements, I think.


why outwith non latin languages accents and oddities like the sharp S dont actually change the meaning.


Using RegEx is generally frowned upon; RegEx is a bad language to write anything in apart from prototypes. Furthermore, this will not work for most text as even US English text contains special characters. Think paraphrasing other languages, names and imported words.


Frowned upon by who? 'Bad' in what way?

Regular expressions are powerful, and are used in plenty of production software.


I think people dislike [Regex] for the some of the same reasons The Principle or Least Power makes sense.

If you can do the same manipulation in three ones of code it’s more likely to be correct and stay correct. And when you look at it again in six months you won’t have to stare at it. All those little time sucks add up as the code grows.

Edit: autocorrect got me twice.


I'd say (as I often do), 'it depends'.

Firsly, for relatively simple regex expressions, any competent developer should be able to grok them very quickly - at least as quick as the equivelant C#/Java/whatever code.

Secondly, it may be that regex is the most performant solution, and sometimes that matters quite a bit.

Honestly, I just don't get why some people are intimidated by regex.


Regex is seldom the most performant code. I mean the engine themselves are fantastic pieces of engineering but in tight loops I've found I can get - sometimes significant - performance improvements by replacing regex matches or substitutions with purpose written string manipulation. Obviously the results depend massively on several big variables:

1/ the regex engine

2/ host language

3/ problem you're trying to solve

But I've found generally I was better off not using regex for performance critical code.

HOWEVER (!!!) where regex consistently wins is development time. Not just writing the code, but testing (it's trivially easy to test regex) and updating the pattern matching (Vs updating the equipment character matching in an imperative language).

Yeah regex can get ugly quickly, but then so can any language if misused.


Actually thrice: lines -> ones.


* RegEx opens your application up for DoS attacks

* RegEx is not very readable

* RegEx can be (very) slow

* It's not trivial to write RegEx code that achieves your goal in a high-quality way. Often quircks and edge-cases are missed.

I'm not saying that you should never use them, but oftentimes a (much) better alternative of achieving your goals is present.

See https://blog.codinghorror.com/regular-expressions-now-you-ha...


As the article you link argues, use of regular expressions when they are inappropriate is bad. This particular case - finding and replacing certain characters with other characters - is pretty well-suited to the problem, and is probably more readable than a bunch of open code to do the same thing.

(I'm not sure what you mean by DoS attacks - are you referring to the exponential case of backtracking? If so, don't use a regex engine with that problem, and don't use lookbehind/lookahead assertions, which aren't needed to solve this problem.)


> This particular case - finding and replacing certain characters with other characters - is pretty well-suited to the problem, and is probably more readable than a bunch of open code to do the same thing.

No, it is not a good solution to the problem; you ignore my earlier comments. English or latin text is not comprised of the sole ASCII characterset; it contains characters outside this set (quoting other languages, names, imported words for example).


Good thing most regex engines handle unicode ;)

Honestly, I do get your point about inappropriate use of regex, but this kind of simple text manipulation is well suited for regex. The biggest argument against using regex for this kind of problem is performance verses writing the same code programmatically in the host language (assuming you're using a fast AOT compiled language). However even that is a non-issue given the small quantities of text you're decoding.

Also I'd bet the regex in this instance would actually work out more readable because the transformations are basic so you're localising the text manipulation to simple rules rather than multiple lines of byte array reading and thus also potentially having to manually build in your own rudimentary unicode support too.


How is that different from his option 5?

I suspect he put it low on the list because it could be a cat and mouse game of trying to anticipate all the potential information leaks.


Cutting and pasting using Ctrl-Shift-V in Libre Office does the trick. (Then select unformatted.) You still have to manually eliminate the spaces.


This has been a generally known technique for identifying leaks for at least three decades (and certainly much longer---Tom Clancy described it in a novel in 1987). The use of non-printing characters is an obvious extension of the idea. Any journalist who has published copy/pasted material from a confidential source since then is provably incompetent. https://en.wikipedia.org/wiki/Canary_trap


In case it wasn't obvious, that was at least partly aimed at The Intercept: http://heavy.com/news/2017/06/the-intercept-reality-winner-s...


The danger with using non-printable characters is that they may not survive a document conversion. If someone faxes and then OCRs the document your watermark is lost. Someone copies and pastes your document into a dumb text editor with an ASCII font and suddenly your characters show up as very obvious garbage, etc...


This reminds me of a time not too long ago when I was teaching programming courses; at least one of the students would somehow manage to get one of these or other weird characters into source code, resulting in much confusion. On the bright side, I take advantage of the opportunity and an impromptu lesson in data representation and character encoding soon follows.

It's also one of the things where a hex editor is extremely useful --- even if you're not working on low-level, seeing the bits directly can be a great confirmation of correctness.


I had this happen to me once (and only once, I learned my lesson). Our professor gave us a PDF with the problem description and it had a bit of code in it we were to put into our final program. Well, when I copy and pasted it some of the spaces copied as non-ASCII space.


Oh god pasting code from a PDF, that brings me back.

Doesn't your editor handle that? I don't exactly remember but I kinda remember having a button to convert pasted characters to ascii (mostly used for those annoying stylized unicode quotes)


Apparently Atom didn't/doesn't have that feature, at least to automatically display when problematic characters are present. I believe I ended up taking a hex editor to my code to see what was wrong, since it was only a single loop that appeared to stop the compilation.


Same problem if you have the honor of copypasting "smart quotes" from .doc or .html


Also keep an eye out for the Greek question mark, looks identical to a semicolon.


Fingerprinting data to find when it's been copy-pasted is a neat application for invisible characters!

I've also found a lot of identical characters when handling Chinese text. Note that Google Translate does not handle these correctly.

https://github.com/pingtype/pingtype.github.io/blob/master/r...

It's about the Kangxi Radicals Unicode block, compared the CJK characters block. If you want me to write a blog post about it, please comment and I'll get around to it.


It's interesting to me, because I've seen this effect (while copy pasting) and wondered why... if I hadn't been translating I would never have noticed.

Even though some are noticeably different:

⿌ 黾


Thanks for the encouragement - I'll write that blog post when I get a chance!

黾 is the simplified version of ⿌.

(In http://pingtype.github.io click Advanced > Regional, paste into the Simplified text box, then click "Simplified to Traditional")


I really like the fact that ⼚/⺁ and ⽰/⺬ and so on are separately encoded in a single block (technically, though, they aren't).


A fun fact is that the thumbprint text box in the certificate viewer in Windows starts with an invisible character.

So if you have cert and just want to copy paste the thunbprint to some file or application which needs to load it, then copying the full thumbprint probably won't work.

When I said fun I meant frustrating.


At one job I had to declare a moratorium on sending certain bits of information through Outlook because the number of people who needed help cutting and pasting it correctly was becoming a problem. Everything went into the wiki or config files that were checked in or used as attachments.

And don’t get me started on Microsoft and their fucking smart quotes...


The "smart" in "smart quotes" mean that they hurt, not that they're intelligent.


This. I used to maintain a software project that consisted of a few inter-communicating services on clients' windows machines (not just servers, but that would have been optimal). The most difficult part of making a sale was getting the implementation guys to correctly install these components, issue a self-signed key from the machine's local CA, and bind it to the local dns/ssl port. Not to mention most people don't even really understand how/why certificates work, so if they ran into the tiniest snag it was going to completely block progress until a developer could take a look. Barf.

Working with certificates on Windows in general is error-prone and difficult to automate (this coming from someone who spent more than a decade developing in .NET).


I've dealt with zero width characters on Windows causing problems in a variety of situations; all appear to have originated by copying something from SharePoint or Lync / Skype4business. It was included in a SQL query, causing it to fail to parse. Another time, one ended up somehow getting into a database field which then created filesystem paths containing the character, which was a much trickier one to figure out.


At one of my jobs we used SharePoint for documentation and I would constantly have issues with it inserting zero width characters, particularly when pasting into a terminal. I finally got tired of it and wrote a small Chrome plugin that copies text without those characters.


I hate this so fucking much.


Hi everyone, thanks for the feedback! I'm not sure if this deserves its own submission or not, but I have a short update to this post available here:

https://www.zachaysan.com/writing/2018-01-01-fingerprinting-...

One very interesting comment from an editor of The Weekly Standard.


Elon Musk did this nearly a decade ago at Tesla to try to identify a source of leaks. He sent many copies of a memo with slightly different versions to key team members.

It backfired when one of the top executives forwarded his own copy to the rest of the team.

https://www.cbsnews.com/news/should-management-spy-on-employ...


How did it backfire? He forwarded his copy... and what?


You wouldn't know who specifically leaked, just that it was either the exec or someone who received the forward.


It also revealed the existence of the trick, since people saw their copies were different.


I use LibreOffice and these appear as grey characters (at least some of the characters do). I've wondered what the grey characters were in the past and now I know.

Copy and paste the examples into LibreOffice and you will immediately see them.


For what it's worth, there are genuine use-cases for these. My friends and I use them for our IRC bots so they can mention people's names without notifying them.


I sometimes use invisible white spaces to sort entries in lists I don't have control over the comparator.


That's quite the interesting use.


This is so sneaky! I am trying to find a good black list and found this:

http://kb.mozillazine.org/Network.IDN.blacklist_chars

Or maybe black listing is not the best approach, maybe a mix of multiple approaches. First strip out stuff and then view the text in a program that displays "unconventional" characters? As a test I pasted the post's test sentences in Vim and the invisible characters are replaced by blocks of <XXXX> that are very hard not to notice. The more you think about it the more tricky corner cases you find :O


The existence of homoglyphs in Unicode is a failure of Unicode's mission. Two sequences of characters that render the same should be the same. Encoding invisible semantic information into Unicode is a huge mistake.


I disagree, if anything they didn't go far enough. Unifying Chinese and Japanese characters making them locale dependent was, IMO, a mistake. The kind of problem I have with this can be seen with the so called "Turkish I" problem. The Turkish language has 4 'i's: i, I, ı and İ. Unicode decided to encode the first two using the points for latin lower case i and upper case I. In Turkish, the capitalization rules say that i and ı are lower case, and their upper case counterparts are İ and I. You can see that if you have a byte stream for which you don't know the locale, you cannot correctly apply capitalization to it. That way all 'I's would be unambiguous in all contexts. This is not as trivial a problem as it sounds[1]. This could have been avoided if Unicode had a Turkish i and I. You can extrapolate this issue to entire languages.

[1] https://gizmodo.com/382026/a-cellphones-missing-dot-kills-tw...


This is beyond the scope of Unicode. Unicode should be how text is displayed. If there is meaning that is not in the display of the character, then it is outside the scope of Unicode.

After all, we have many uses for the letter 'a', such as a) a*b=c and b) a as in apple. Should those 'a's have different Unicode code points?

Semantic meaning comes from context, and there is no context for a Unicode code point. Trying to insert semantic meaning is both a mistake and a patently impossible task. The article points out some of the wreckage attempting to do this causes.


> Should those 'a's have different Unicode code points?

Unicode does have separate code points for mathematical symbols. See for instance U+1D44E MATHEMATICAL ITALIC SMALL A. They see little use in practice though, apart from people using them for funky fonts on Reddit and such.


They are used by OpenType math fonts and systems that use them (e.g. the new equation editor in Word from 2007 on wards, and Unicode-capable TeX systems with the `unicode-math` package).


I didn't know that, thanks for the info. But the use of a) and b) remains, as does ascii art, [a] for footnotes, it's just endless.


> But the use of a) and b) remains, as does ascii art, [a] for footnotes, it's just endless.

You mean like ⒜ or ⓐ or ᵃ? Not to be confused with any of ªa𝐚𝑎𝒂𝒶𝓪𝔞𝕒𝖆𝖺𝗮𝘢𝙖𝚊ₐ. Unicode has a whole lot of stuff that nobody uses…

There's also a separate Cyrillic а.


I’ve seen all of these used quite extensively. It depends on what kind of fields you are familiar with.


So thanks to Unicode, things that look the same can have different representations, and things that look different can have the same one. (Limited) text used to be easy, now it's even worse than recording pen-strokes in a vector graphics format. How is this still useful?


It doesn't even work for the Turkish i. One cannot proofread the text to see if the right i is used - nobody is going to check the code point numbers.


That's where defaults become important. When switching to a Turkish locale the system would try to use the Turkish version for the characters, but of course this would make it harder for Turkish speakers to switch back and forth with other languages...

I guess we've always been in the days of "worse is better", like using CSV even though ASCII encodes characters specific to record separation[1].

[1]: https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text


How can it be "a failure of Unicode's mission" when the mission does not really include that? The foremost reason that Unicode exists is the unification of tons of existing legacy character sets and encodings and not the technologically perfect system. There are enough duplicates and invisible characters in a single character set (yes some have duplicates by accident or design), across multiple character sets (so you like to map Latin A and Cyrillic А to the same character while retaining the collation order for each case...), or simply necessary to multilingual systems that jeopardize any attempt to "perfect" it.

Granted, there are several failed experiments (e.g. interlinear annotations which are completely obsoleted by markup languages) and several pain points (e.g. shrug Emoji) inside Unicode. I don't really like them, you may not like them, but how would you make such a system without all these interim works?


Glyphs that look identical should have the same code point. It's as simple as that.


Then use an image, not a text. (Or use a font-mapped text which is not externally exposed by itself. Seemingly similar to PDF, but alas, PDF supports textual selection ;-) If you cannot afford that you'd better acknowledge the complexity of multilingual systems.


So what letters should be used to name the USSR? SSSR or (Cyrillic) CCCP?

Should the Greek alphabet be removed? Why have delta when d is just as good?


The greek alphabet has different glyphs.

What I'm talking about is when the glyphs are identical, i.e. homoglyphs. Homoglyphs should be removed.


So which of these are Latin and which are Greek ΑΒΕΗΚΜΝΟΡΤΧΥΖ, ABEHKMNOPTXYZ?


I've been wondering whether it could be possible to create a formal language through which one is capable of expressing ideas and facts about the world around us rather than values and variables as is the case for programming language. For humorous effect people sometimes write things like if ( !food ) goToStore(); -- could it be possible to formalize such constructs? If the language is formal, then the author's output can be run through a post-processor that re-formulates the expression such that superfluous data, if any, is removed to make stylographic characteristics disappear, akin to, and maybe using the same foundations as the reduction of mathematical equations into a bare minimal set of symbols. Furthermore, a mathematical approach to reasoning about real-world concepts is interesting. Computational philosophy?

Tangentially related to this topic (ulterior fingerprinting), I wonder whether websites like Twitter might be encoding your IP address or account ID in pixels on your screen (so subtle that it's impossible to discern with the naked eye) to make it easy to track screenshots back to you.


I just copied and pasted the example text in the OSX Text Editor app, and all the zero-width characters got removed.


Short anecdote:

A few years ago I had to fill in some translation keys within a ecommerce shop gui. Eventually, I had to revert some key to it's default value and therefore tried to copy and paste the displayed default value, but the system always refused to take that value. It complained that the string contained invalid characters, but I wondered because I just inserted the previous valid default value and to my eye there where not special characters anyway?!?

So I called one of the devs and after a few minutes he told me, that the output I copied contained a zero-width space and that character was not allowed by the validation engine. So when I typed the string myself everything went fine ;-)

Nowadays, I like to consult `hexdump -C` in such cases.


Zero width non-joiners are also a great way to bypass swearing filters in most games and forums.


Yes, I've noticed this very often. Also people who use RTL override and type text backwards.


I made a jsfiddle so you can see if there are funny characters in text like the

""" We're​ not the​ same text, even though we look the same.

We're not the same​ text, even though we look the same. """ example by cutting and pasting it:

https://jsfiddle.net/tim333/bjL018k1/

(It took me a lot of googling how to handle unicode in javascript, that)

Slightly to my surprise the zero width spaces survived being posted into this comment.


> it appears both homoglyph substitution and zero-width fingerprinting have been discovered by others

They were discovered a long time ago, and there are many other ways to hide data in documents.

Remember that you only need to encode enough bits for a relatively unique ID (and not unique for all files in the universe but only for files with the same content - for a low-distribution file, even 2-5 bits might be enough). On the application level, the most common applications and formats have a very large number of features you can utilize to encode data or simply insert it (e.g., Word, Excel, the PDF standard). On the bit level, unless the application vendor has invested in writing exceptionally tight, secure code, you probably can find someplace to hide/encode a few bits in a file.

But I think the author is on the right track with the solutions ...

> Use a tool that strips non-whitelisted characters from text before sharing it with others.

A more general solution is needed: Something that normalizes data in many formats, from text to Word to PDF to JPG to WAV to markup languages.

Personally, for non-security reasons, I'd love a utility that normalizes text to 7-bit ASCII (e.g., from UTF-8 characters higher than 7 bits) and that fits very efficiently into workflow (e.g., something that normalizes text in the clipboard if I press a hotkey combination). Anybody know of one?


How might this effect accessibility (e.g., screen reader for the visually impared)?


Microsoft Word still has the correct word count.

Using the Option-Right Arrow to move through the text a word at a time has a problem though. The cursor appears to gets stuck at that point, and MS Word shows the font changing.


I tested with AppleScript's "say" command, and it doesn't pronounce spaces, so screen readers are probably fine.

It might affect string handling where you split a string by space characters, and then compare words with a dictionary.

For example, Pingtype English which translates word-by-word to Chinese. (note the words "the" and "same" not getting translated in the examples).

https://pingtype.github.io/english.html

Google Translate handles it fine.


This was a common thing used when the first word processors came out. They called it 'micro-kerning' and it involved slight adjustment of the letter and word spacing. The fit of characters is called 'kerning' and it increases readability - and also the ability to create pages with the copier's ID findable. The document source would create a novel document for each person, and if they leaked it and it was reproduced in a paper, the character spacings could lead to the leaker. This was also used with synonym use as a guise. Most spy types would sanitize their docs with a scanner and paraphrase many of the words.


So please normalize your texts: unorm --help

https://crashcourse.housegordon.org/coreutils-multibyte-supp...


This blog post (from colleagues) covers the ideas in a little more detail http://blog.fastforwardlabs.com/2017/06/23/fingerprinting-do... and comes with proof-of-concept code https://github.com/fastforwardlabs/steganos.


Since I'm looking at what I assume are ASCII characters, I just typed this into the console:

"We're not the same​ text, even though we look the same.".split("").forEach(function(c, i){ if(c.charCodeAt(0) > 127) {console.log("Danger: weird character detected: code " + c.charCodeAt(0) + " at index " + i);} });


For languages other than English (or even for English with some accented characters), you'll need to do a bit better than just spotting non-ASCII characters. Finding zero-width/not fully normalized characters would be a bit more robust, but still not perfect. IIRC, the zero-width joiner/non-joiner are important for rendering various Indian languages correctly. But I guess it depends on your use case, and how many false positives you can tolerate.


What I'd want for most cases is to "paste as printable ASCII." That covers a large number of use cases (the article's example is printable ASCII ostensibly to reach the largest audience).

I'm not sure what to do with more complicated language mixing, but I suppose a script to tell you what languages are being used and where there might be "out-of-place" character codes would help.


Solving this seems pretty similar to how browsers handle phishing related issues with International Domain Names.

https://en.wikipedia.org/wiki/IDN_homograph_attack


Recently I played around with zero-width chars to test some inputs. It's awesome because big payloads can be just one innocuous line. The downside is that navigating through text with arrows will also navigate through invisible chars appearing stuck for a couple clicks in one place.


If you use Visual Studio Code or Sublime Text, there are extensions to highlight characters like that:

https://marketplace.visualstudio.com/items?itemName=nhoizey....


As an add-on, a lot of characters can be represented in different ways. For instance there is the character ä and then a with 2 horizontal dots. For the "usual" cases a normalization solves this though.


This is hard to detect for the average user. Even for someone seasoned, like me.

I dare you to inspect the html in Safari, Chrome, Firefox, and even curl, to see if you can view the raw character encoding.

Some you can, some you can't.


This is why I still swear by Lynx for most browsing


The two different sentences do not look different for me in Lynx. I guess Lynx just passes the raw characters on and any Unicode handling happens in the terminal.


I should have said, "Lynx and a fairly dumb terminal"


Wouldn't any sensible journalist round trip any documents through 7 bit asci to strip any possible watermarking.


Most journalists don't have a technical background and don't know how Unicode works. Plus round-tripping through 7bit ASCII is lossy to characters you may want to keep (accented loanwords / names, non-english text) and doesnt prevent all the attacks in the article (providing text with slightly different spellings / word orders / etc). 7bit ASCII also has invisible control characters of its own in the 0x0 - 0xF1 range...


an American journalist handling English text, maybe.


7b ASCII is so passé that “roundtripping” through it would be very naïve.


so good opsec requires some trade-offs just as say MAC has some disadvantages compered to DAC


anyone else bothered by the (mis-) use of "fingerprint" instead of "watermark"?


The difference being that “fingerprint” means deriving a signature from some existing data, and “watermark” means altering data to make it identifiable?


Mind blown, so simple now that you explained it but boy did it never occur to me. Thanks!


Interesting, when trying to access above site get security warning that site is unsafe.


Also useful for pretending chat-bots are broken.


One interesting trick I found for clearing wierd characters from the clipboard in running `pbpaste | pbcopy` in the (macos) terminal.


Just curious why that would do anything? Isn't it the same byte stream coming out that then goes into the other one?


ASCII-friendly alternatives for invisible fingerprinting:

- replace quotation marks with two apostrophes

- replace “I” with “l”

- replace parentheses with slashes or square brackets

- replace commas with semicolons

Each of these substitutions can communicate a single bit while being ASCII-safe and likely won’t change word wrapping opposed to synonyms.


All of those except the last one are pretty obvious, especially if you're mixing them to "communicate a single bit".

Speaking of quotation marks, the non-ASCII quotes in your second item and apostrophe in the last sentence stand out too.


They are obvious if you have two copies and are actively looking for differences, otherwise not. If someone sent me a document with square brackets somewhere i wouldn't question their use of square brackets over parenthesis, unless they don't match.


There are many more options: Insert one of the many non-printing ASCII codes. Add an extra space, especially before a newline. Add or remove an extra newline at the bottom. Change punctuation where the choice is believable either way: Single quotes for double, commas for semicolons, periods for colons.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: