In Russia there is a government procurement portal. Where gov organizations have to post their requests to enforce competetion and best prices.
The usual tactics [1] of corrupt officials was replacing cyrillic (russian) letters with respective latin homoglyphs so only affiliated companies can find and win this contract.
Kind of surprised at how poorly Google handles this (I would have expected at least a correction suggestion)! Heck, it might open the door for an obscure blackhat/phishing technique...
Not Google, but apparently one way some malware tries to hide edits to, say, the hosts file is to create a duplicate hosts file with the cryllic homoglyph for 'o' and then hide the real hosts file.
Presumably this would trick users who would go check "C:\windows\system32\drivers\etc\" but not show hidden files. Seems like a niche subset, but still a neat trick.
No, you're mistaken. It is actually a very big problem. Earlier on the same page you linked to, it explain that "ICANN approved the Internationalized domain name system, which maps Unicode strings used in application user interfaces"[1].
As a concrete example, the following are fake links to Wikipedia (and entirely equivalent):
It is true that network protocols encode these internationalized domain names in a subset of ASCII, but the user sees Unicode in his browser address bar or email. There is no restriction on how applications (like browsers) display domain names[2]; they can use Unicode if they want. This lead to all sorts of devious attacks[3].
Maybe some sort of extortion scheme? Send an email to a small business person that isn't very technically savvy, say you have just erased all the search results for their business from Google, provide link, demand a Bitcoin to return the results.
Maybe a low hit rate, but if you could automate it, you could run the scam on a lot of places.
So, one might wonder why these homo-graphs have different code points. After all the French A and the English A share the same code point.
It's really difficult to do the right thing here. If Greek question marks share code point with semi-colon, it obstructs search and replace for question marks.
Subtle differences in how Japanese and Chinese are written has led to differently written characters sharing the same code point. It's nice that you can easily look up most Japanese characters in a Chinese dictionary and see how they are used in China, but it has become frustratingly hard to get subtleties in their written form right. The Chinese version may have the line strike through another line, while the Japanese only has it touching.
I honestly don't know how to go about posting how to same code points have different written forms!
But it seems like it would be nice if code editors warned about text outside ascii. You usually only want that in strings and comments.
> It's really difficult to do the right thing here.
There is a fairly good solution to this for things like code editors and URLs or search strings in browsers: If a string contains a non-ASCII code point that is a homograph for an ASCII character, swap the text and background colors.
The reason for a lot of homographs with ASCII is that there are old code pages that have entire non-US alphabets and punction sets in the 128-255 range, and regular ASCII in the 0-127 range.
It's a design goal of Unicode to support exact round-trip transformation with any code page ever in use in the real world, so they can't unify two characters that appear at different code points in a greek codepage without breaking things, even if they're always graphically indistinguishable in a font.
> It's really difficult to do the right thing here. If Greek question marks share code point with semi-colon, it obstructs search and replace for question marks.
Context is the key here. Greek text doesn't use the semicolon for other purposes and searching/replacing such single characters in source code is a terrible idea anyway (think comments, string literals...). So what is the prohibitive failure scenario here?
Indistinguishable (for humans) characters with different code points were a stupid idea, it's fine to abuse it in order to point out that fact.
They're not necessarily indistinguishable. Line-break rules, directionality and typographical conventions (height/width/alternate glyphs) may differ between apparent homographs. And using the same code point would make distinguishing between, say, Latin and Cyrillic searches difficult. How would Google tell if you mean CCCP or СССР?
I don't know enough about Greek to deeply comment on it. some people do consider using the same code point for apostrophe and citations problematic (it's definitely annoying when doing word segmentation) we also have breaking and non-breaking spaces, as well as tabs. We luckily didn't follow typewriter conventions and collapsed several alphabetic characters!
A semicolon is not considered sentence final, whereas a question mark usually is. This makes it easier for software to auto capitalize. So at least that's possible a use case.
There's also the possibility that Greek requires the top dot to be square or circle, meaning it might in fact have subtle differences in print.
> some people do consider using the same code point for apostrophe and citations problematic (it's definitely annoying when doing word segmentation)
And those people are a menace. "Ball bearings" is a single word with a space in the middle. The only way you're going to get reliable word segmentation is with a natural language parser and a lexicon with an entry for "ball bearing". At that point, you're already recognizing "aren't" as a word; the punctuation isn't really relevant.
On the other hand, if you're not particularly upset about messing up space-including words, there's no real reason to be upset about apostrophe-including ones either.
> A semicolon is not considered sentence final, whereas a question mark usually is. This makes it easier for software to auto capitalize. So at least that's possible a use case.
Such software will need and have a language setting anyway (e.g. for hyphenation). It doesn't have to and cannot rely on code points alone, so the characters (or rather, different uses of the semicolon) needn't have different codepoints.
There is no "French A" and no "English A", I believe both languages call their alphabet the Roman (or Latin) alphabet. So those really are the same letter.
The Chinese could do that, but they'd be wrong. ‘French A’ and ‘English A’ have undiverged continuity with ‘Roman A’. Chinese writing is neither descended from Roman nor an alphabet.
The problem is compounded by the fact that in Chinese, there are "traditional" and "simplified" versions of many characters (with the Japanese character usually, but not always, being the same as the traditional Chinese one). And let's not even get into the different writing styles. The distinction between different characters and different ways of writing the same character is not always clear.
I had to zoom in to my screen just to see the difference between those.
I know that I am perhaps not well informed on this issue, but I also think that having multiple code points for what most people would perceive as the "same" character is not a good idea.
They are not perceived as the same by most people in those cultures. There are tons of characters that to someone who's eyes are not used to spotting the difference would not notice they are actually different. A single dot, or touch vs pass through or curved at the tip vs not make all the difference
入 vs 人
玉 vs 王
口 vs ロ <- my font might make those indisguishable out of context
In English, even within the same letter there are variations on how it can look. For instance, the letter a can look different based on serif/sans-serif, double-story/single-story, or even cursive and italic. There isn't that much variation in fonts for Asian characters, is there?
There are also plenty of variations in calligraphy that make some store signs nearly impossible to read (for me). There are differences but many of them come from computers not being able to properly render pictographs.
It's a matter of the font size. At the size they're usually displayed, the characters look quite different [1] [2]. The real problem is that there's no way to mark a character as Chinese or Japanese on HN. So if it actually was the same character it would've been displayed identically. A common character that looks very different in Japanese and Chinese is 直 [3] and there's only one Unicode codepoint for it.
now that we gave up on ucs-2, we could re-encode those overloaded Japenese/Chinese characters as separate Japenese and Chinese characters on astral planes like the supplemental multilingual one.
It offers a convenient utility to diff arbitrary strings, which is also quite handy for e.g. detecting normalization discrepancies, and installs a service so you can highlight a character in any app and use “Display character information” to see what it actually is.
I work for a "major search engine" that does a lot of advertising & marketing stuff. To get the most out of it, we need customers to implement some javascript on their ecommerce sites.
As is often the case, javascript code that needs to get implemented on an ecommerce site often gets copy-pasted or emailed around a lot internally within a customer before it reaches the right person who can add it to the site's pages.
In this example somewhere along the way, a normal javascript snippet got all of the semi-colons changed from ; to ;.
It was very confusing why Chrome was moaning about a semi-colon an illegal token. I had a genuine "Am I going mad? Seriously?" moment before I realised what was happening.
I've been bitten by things like this so many times, that my first reaction on seeing an inexplicable syntax error is to delete and manually re-type the line.
I was recently bitten by OS X's non-breaking space shortcut. I'd accidentally typed alt+space instead of space due to typing a following # (alt+3), and that made weechat, an app unaware of the NBSP, see the command '/join #channel' instead of '/join'.
It was a while ago now, but I remember copying the code out to a controlled isolated environment on my local machine, picking the first line and making sure it reproduced.
After that it was a matter of just going through the usual steps of working out why something isn't working. I think in this instance I actually ended up manually retyping the code, and when getting "identical" code that worked that was when the penny dropped and I realised something was not what it seemed. If you paste suspect characters into something like http://unicode-table.com/en/ it will tell you right away what it really is.
Weird bugs can respond well to brutally dumb debugging. Open file and search for ; perhaps? Or grep etc. When you can see semi-colons but your find tool can't, that's a worry.
But yeah getting to that point relies on pure inspiration, in my case.
I find searching, and search and replace the most useful IDE feature, you can reformat entire documents with a series of well thought out search and replaces:)
I'd like to know, too. Until then, you can type "ga" in normal mode to get a display of the decimal, hex, and octal value of the character under the cursor.
Something similar happen to me, but with that fancy quotes. I spotted by doing a "binary-search weird bug hunt". Cut half of the code off and see it it's still complaining, if it's, cut the other half, and so on.
I had a similar problem. Copied a code snippet. Ruby started complaining about an undefined function. After nearly going mad, and then looking at the source through a hex editor, you could see Unicode whitespace. I have yet to forgive ruby or Unicode whitespace. Or the chat utility from whence I copied.
Hah, I was giving a presentation where I was running little queries as part of a demo. I had copied the queries into the presenter notes in PowerPoint, then pasted them one at a time into a web app to run them.
Couldn't figure out why one of them wasn't working, and it was actually an audience member who figured out PowerPoint had turned a quote character into a "smart quote".
I once had to do a team project with another student who did all his coding in Wordpad, god knows why. His indentation was more or less random. I wanted to murder him.
I've had to use a couple of times in order to get a proper indentation. I know there's <PRE> and <CODE>, but they aren't allowed on every blog, so I had to improvise.
Then for $DEITY's sake, put the code snippets on pastebin or similar. Nothing's more infuriating than having to hex edit a file just to get all the \u00A0 out.
> blacklist pull requests from anyone who forks it
Why, if I may ask? If they introduce compile errors in your project, those should be caught by the CI build and test run, shouldn't they? In any case, accepting a code change just by looking at the diff and without even trying it out sounds like not the best course of action to me anyway.
Yes. A good CI will blow up any malicious pull requests provided you are using a compiled language. Or if interpreted, you need sufficient code coverage that your tests will blow up
I actually almost submitted something in that vein once. I'd type
> ls | wc -l
and get
> bash: wc: command not found
As it turns out, I need Alt+1 to type a pipe character in my keyboard. If I'm not quick enough releasing the Alt key, I'll type Alt+Space instead of just Space, which inserts a Non-breaking space[1] in Mac. This character is not a space, and therefore it gave me a weird "command not found" error.
This lasted for months until I found out what the problem was - given that it was a combination of my keyboard settings and OS, finding the root of the error took quite some time. The hint? The "command not found" error had an extra space in front of the unknown command.
Author of Python port of Unidecode here. I wrote a comment previously, pointing out that Unidecode does the reverse of Mimic. But then I actually checked the tables of characters that Mimic uses and deleted my comment.
Mimic chooses replacement characters solely based on their visual similarity with ASCII. Unidecode, while still doing character-by-character replacements without deeper analysis, tries to optimize the replacement tables for transliteration of natural languages.
For example, mimic will replace Latin capital H with Greek capital eta (U+0397), because they look similar. However, Unidecode will replace U+0397 with Latin capital E, because Latin E is typically used in place of Greek eta when transliterating Greek text to Latin.
On a Mac you (used to?) get a non-ascii space when you hit the space bar while holding Alt or something like that. Easy to fat-finger it in any case and looks the same in most text editors. It's a great source of fun for novice Mac-using programmers to find out why the compiler complains.
It's not terribly difficult to define custom keyboard layouts for OS X. Make a copy of your preferred layout and get Ukelele [sic] from SIL¹ to remove NBSP from Option-Space. (Or just hand-edit the XML changing " " to " ".)
The Commodore 64 (or some other machine from my childhood) would generate a non-ASCII space if you held down control (or shift maybe) when pressing it. To this day I'm careful about that. I didn't know it was still a possible problem.
Ironically I have weird OCD where I always assume I made a typo, so I keep deleting and retyping code a few dozen character at a type, often in lines where I see nothing wrong. Over time this has just become something my hands do whenever my brain needs time to think about something else. So in a way I developed natural immunity to said unicode tricks ;)
I think you're not alone. A common error I've noticed is when you make a typo somewhere (that compiles) and copy and paste it in a different place where you have the correctly named symbol. It's often hard to see the typo because the eye fly over the word. So you erase and type it manually.
There's a set of rules used on domain names to stop homoglyph abuse there.[1][2] Applying those rules to language identifiers would prevent this problem. It's also useful to apply those rules to login names for forum/social systems. The rules prevent mixed language identifiers, mixed left to right and right to left text, and similar annoyances.
Ruby now allows (some) Unicode glyphs as names (allowing for things like Δv).
08:11:32 >> Δv = 3
=> 3
08:11:39 >> p Δv
3
=> 3
My solution when I have problems like this is to start building a negative regexp in vim:
/[^-a-zA-Z0-9 \[\]]
I then add other symbols as I find them. I can usually find the illegal characters in about 30 seconds this way—and I can add the non-ASCII glyphs that I expect to be present to my regexp.
> Many programming languages support non-ASCII variable name characters now.
Just because you can do something doesn't mean you should.
It is usually worth keeping variable names and such in English in enable international collaboration. Also non-ASCII source files can get mangled in transit.
Well, it does happen, though – look at this weather data from a large German newspaper, it is in a custom format ('|' separated values) and in German: http://wetter.bild.de/data/meinwetter.txt
It happens all the time, everywhere, that people write code and stuff in their native language.
I think the line about "Mimic substitutes common ASCII characters for obscure homographs" has it backward. Shouldn't it say Mimic substitutes obscure homographs for common ASCII characters?
Never occurred to me before, but here "substitutes" reads to me as being commutative. I read both as having the same meaning. (i.e. you end up with unicode homographs replacing your ascii) Just me?
I don't know about C# compilers, but gcc gives me two errors, "stray ‘\315’ in program" and "stray ‘\276’ in program", which I suppose are the two utf-8 bytes. Rust says, "unknown start of token: \u{37e}". Either way, you get a pretty strong clue that there's a funny character present.
If faced with a linter error, I don't typically delete the marked stuff, write it anew and hope fingers crossed that the error would be gone. I would try to make sense of the message, how it applies, and what the error is. At some point though, I definitely would pull my hair over a greek question mark.
Only one character is going to be marked in this case, not a whole line or section of code. Deleting it and retyping it costs one second. I guess I've seen more than my fair share of encoding issues. I used to tutor at a university, so students were constantly coming in with code they'd copy/pasted out of their assignment (usually a Word doc) or from a web site.
I think that's a great argument. If someone mails the code, I hope to have the cleverness to suspect the encoding. However, I thought about a code repository or similar where this may be an issue, but most often is not. And I have seen some code where a wrong language character did not provoke a reasonable error, but some arbitrary parser error that went off in another line altogether (not necessarily C#).
The repo's README mentions a vim plugin to highlight Unicode homoglyphs. As an Emacs user, I did a quick M-x package-list-packages, thinking I'll find at least half a dozen equivalent Emacs packages.
To my dismay, there were none. So I spent the rest of my afternoon correcting this glaring deficiency. Fellow Emacs users, protect yourself from Unicode trolls and grab it here: https://github.com/camsaul/emacs-unicode-troll-stopper
I don’t have an audio snippet, but I can transcribe what the voice says on different runs. It usually pronounces random letters individually, but sometimes pronounces syllables with letters missing:
is anyone aware of the reverse of this, a homoglyph normalization library? id love to be able to take strings that visually look the same and compare them against one master list, such as for spam detection
In some languages which allow non-ASCII but aren't Unicode-aware (PHP, for instance), you can add significant, invisible zero-width spaces to identifiers.
In Russia there is a government procurement portal. Where gov organizations have to post their requests to enforce competetion and best prices.
The usual tactics [1] of corrupt officials was replacing cyrillic (russian) letters with respective latin homoglyphs so only affiliated companies can find and win this contract.
[1] http://www.bbc.com/russian/rolling_news/2013/04/130409_rn_st...