Hacker News new | past | comments | ask | show | jobs | submit login
Mimic – abusing Unicode to create tragedy (github.com/reinderien)
408 points by epsylon on Oct 23, 2015 | hide | past | favorite | 171 comments



Slightly tangential:

In Russia there is a government procurement portal. Where gov organizations have to post their requests to enforce competetion and best prices.

The usual tactics [1] of corrupt officials was replacing cyrillic (russian) letters with respective latin homoglyphs so only affiliated companies can find and win this contract.

[1] http://www.bbc.com/russian/rolling_news/2013/04/130409_rn_st...


Now that you have revealed the secret, Hacker News will be banned in russia forever.


Taylor Swift? Never heard of her: https://www.google.com/search?q=Τаylοr+Ѕwіft

:D


Kind of surprised at how poorly Google handles this (I would have expected at least a correction suggestion)! Heck, it might open the door for an obscure blackhat/phishing technique...


Not Google, but apparently one way some malware tries to hide edits to, say, the hosts file is to create a duplicate hosts file with the cryllic homoglyph for 'o' and then hide the real hosts file.

Presumably this would trick users who would go check "C:\windows\system32\drivers\etc\" but not show hidden files. Seems like a niche subset, but still a neat trick.


Google operates in a bunch of languages, they can't necessarily just assume you're using codepoints on accident.


Or for shady SEO contracts. Not hard to guarantee #1 results if the contract spells out the exact (unexpected Unicode) phrase you're guaranteeing.


Forget google. Can you do this in domain names?



No, you're mistaken. It is actually a very big problem. Earlier on the same page you linked to, it explain that "ICANN approved the Internationalized domain name system, which maps Unicode strings used in application user interfaces"[1].

As a concrete example, the following are fake links to Wikipedia (and entirely equivalent):

http://xn--wkd-8cdx9d7hbd.org (FAKE, same as below)

http://www.wіkіреdіа.org (FAKE, same as above)

It is true that network protocols encode these internationalized domain names in a subset of ASCII, but the user sees Unicode in his browser address bar or email. There is no restriction on how applications (like browsers) display domain names[2]; they can use Unicode if they want. This lead to all sorts of devious attacks[3].

[1] https://en.wikipedia.org/wiki/Domain_name#Internationalized_...

[2] https://en.wikipedia.org/wiki/Internationalized_domain_name#...

[3] https://en.wikipedia.org/wiki/IDN_homograph_attack


Maybe some sort of extortion scheme? Send an email to a small business person that isn't very technically savvy, say you have just erased all the search results for their business from Google, provide link, demand a Bitcoin to return the results.

Maybe a low hit rate, but if you could automate it, you could run the scam on a lot of places.


Similar in theme to that trick of sending strangers that "link to your facebook page" (http://facebook.com/profile.php?=73322363)


Can someone explain how this works?


The variable name before "=" is missing, therefore the given profile id gets ignored. Per default your own profile id is assumed.


Any invalid ID redirects to your own page, or something like that.



So apparently only the T has been replaced, but why is it showing different values for the "a"?


The 'T', 'a', 'o', 'S', and 'i' were all replaced, not just 'T'


Many have been replaced, with cyrillic counterparts


So, one might wonder why these homo-graphs have different code points. After all the French A and the English A share the same code point.

It's really difficult to do the right thing here. If Greek question marks share code point with semi-colon, it obstructs search and replace for question marks.

Subtle differences in how Japanese and Chinese are written has led to differently written characters sharing the same code point. It's nice that you can easily look up most Japanese characters in a Chinese dictionary and see how they are used in China, but it has become frustratingly hard to get subtleties in their written form right. The Chinese version may have the line strike through another line, while the Japanese only has it touching.

I honestly don't know how to go about posting how to same code points have different written forms!

But it seems like it would be nice if code editors warned about text outside ascii. You usually only want that in strings and comments.


> It's really difficult to do the right thing here.

There is a fairly good solution to this for things like code editors and URLs or search strings in browsers: If a string contains a non-ASCII code point that is a homograph for an ASCII character, swap the text and background colors.


I meant the right approach to designating code points.


In today's lesson, wodenokoto learned that real life is infinitely more complicated than San Francisco startups would have you believe.


The reason for a lot of homographs with ASCII is that there are old code pages that have entire non-US alphabets and punction sets in the 128-255 range, and regular ASCII in the 0-127 range.

It's a design goal of Unicode to support exact round-trip transformation with any code page ever in use in the real world, so they can't unify two characters that appear at different code points in a greek codepage without breaking things, even if they're always graphically indistinguishable in a font.


> It's really difficult to do the right thing here. If Greek question marks share code point with semi-colon, it obstructs search and replace for question marks.

Context is the key here. Greek text doesn't use the semicolon for other purposes and searching/replacing such single characters in source code is a terrible idea anyway (think comments, string literals...). So what is the prohibitive failure scenario here?

Indistinguishable (for humans) characters with different code points were a stupid idea, it's fine to abuse it in order to point out that fact.


They're not necessarily indistinguishable. Line-break rules, directionality and typographical conventions (height/width/alternate glyphs) may differ between apparent homographs. And using the same code point would make distinguishing between, say, Latin and Cyrillic searches difficult. How would Google tell if you mean CCCP or СССР?


I don't know enough about Greek to deeply comment on it. some people do consider using the same code point for apostrophe and citations problematic (it's definitely annoying when doing word segmentation) we also have breaking and non-breaking spaces, as well as tabs. We luckily didn't follow typewriter conventions and collapsed several alphabetic characters!

A semicolon is not considered sentence final, whereas a question mark usually is. This makes it easier for software to auto capitalize. So at least that's possible a use case.

There's also the possibility that Greek requires the top dot to be square or circle, meaning it might in fact have subtle differences in print.


> some people do consider using the same code point for apostrophe and citations problematic (it's definitely annoying when doing word segmentation)

And those people are a menace. "Ball bearings" is a single word with a space in the middle. The only way you're going to get reliable word segmentation is with a natural language parser and a lexicon with an entry for "ball bearing". At that point, you're already recognizing "aren't" as a word; the punctuation isn't really relevant.

On the other hand, if you're not particularly upset about messing up space-including words, there's no real reason to be upset about apostrophe-including ones either.


> A semicolon is not considered sentence final, whereas a question mark usually is. This makes it easier for software to auto capitalize. So at least that's possible a use case.

Such software will need and have a language setting anyway (e.g. for hyphenation). It doesn't have to and cannot rely on code points alone, so the characters (or rather, different uses of the semicolon) needn't have different codepoints.


What if someone is quoting something in a different language for a translation or similar?


To be fair, that would break hyphenation as well if both languages use a Latin alphabet (give or take a few carons and such).


There is no "French A" and no "English A", I believe both languages call their alphabet the Roman (or Latin) alphabet. So those really are the same letter.


The Chinese could call their writing "the Roman alphabet" too (they don't), and label 闩 "A". So what?


The Chinese could do that, but they'd be wrong. ‘French A’ and ‘English A’ have undiverged continuity with ‘Roman A’. Chinese writing is neither descended from Roman nor an alphabet.


> ‘French A’ and ‘English A’ have undiverged continuity with ‘Roman A’.

This is true, but it's equally true that they both have undiverged continuity with 'Greek Α', which has its own code point.


> The Chinese version may have the line strike through another line, while the Japanese only has it touching.

Case in point: 每 (chinese) vs 毎 (japanese).


The problem is compounded by the fact that in Chinese, there are "traditional" and "simplified" versions of many characters (with the Japanese character usually, but not always, being the same as the traditional Chinese one). And let's not even get into the different writing styles. The distinction between different characters and different ways of writing the same character is not always clear.


I had to zoom in to my screen just to see the difference between those.

I know that I am perhaps not well informed on this issue, but I also think that having multiple code points for what most people would perceive as the "same" character is not a good idea.


They are not perceived as the same by most people in those cultures. There are tons of characters that to someone who's eyes are not used to spotting the difference would not notice they are actually different. A single dot, or touch vs pass through or curved at the tip vs not make all the difference

入 vs 人

玉 vs 王

口 vs ロ <- my font might make those indisguishable out of context

千 vs 干 vs 于

of course in English we have 0 vs O and l vs 1


In English, even within the same letter there are variations on how it can look. For instance, the letter a can look different based on serif/sans-serif, double-story/single-story, or even cursive and italic. There isn't that much variation in fonts for Asian characters, is there?


There are also plenty of variations in calligraphy that make some store signs nearly impossible to read (for me). There are differences but many of them come from computers not being able to properly render pictographs.

Compare 人 with: http://i.imgur.com/ZpFMQoU.png for one of the most obvious differences.

For Japanese there is: 篆書、隷書、楷書、行書、草書.

https://en.wikipedia.org/wiki/Seal_script

https://en.wikipedia.org/wiki/Clerical_script

https://en.wikipedia.org/wiki/Regular_script

https://en.wikipedia.org/wiki/Semi-cursive_script

https://en.wikipedia.org/wiki/Cursive_script_(East_Asia)

Can see more here:

https://en.wikipedia.org/wiki/Japanese_calligraphy


It's a matter of the font size. At the size they're usually displayed, the characters look quite different [1] [2]. The real problem is that there's no way to mark a character as Chinese or Japanese on HN. So if it actually was the same character it would've been displayed identically. A common character that looks very different in Japanese and Chinese is 直 [3] and there's only one Unicode codepoint for it.

[1] https://en.wiktionary.org/wiki/%E6%AF%8F [2] https://en.wiktionary.org/wiki/%E6%AF%8E

[3] https://en.wiktionary.org/wiki/%E7%9B%B4


The case of the Turkish "i" is a good example of the difficulties faced by Unicode in this area. See multiple previous discussions on HN.


I was always curios about it: does reading Chinese requires better eyesight because glyphs are more complex?


No, it requires larger font sizes.


Nope, it just requires you to know what the glyph means :o


It's not the same. It's written differently.


So are single-story and double-story ‘a’ and ‘g’. But those don't have nationalist politics attached.


Think of them as words, not letters. It would supremely bother me if I were unable to write "colour" and always had to write "color".

I'm sure a Japanese person would be perplexed as to why that extra "u" matters since it's the "same damn word".


Come to England and go to the first floor of any building.

Want to jump off ?


> But it seems like it would be nice if code editors warned about text outside ascii. You usually only want that in strings and comments.

That's somewhat language dependent, though its true for most uses of most popular languages.


now that we gave up on ucs-2, we could re-encode those overloaded Japenese/Chinese characters as separate Japenese and Chinese characters on astral planes like the supplemental multilingual one.


They're doing that. That's what a lot of plane 2 is.


Mac users might appreciate the great UnicodeChecker:

http://earthlingsoft.net/UnicodeChecker/

It offers a convenient utility to diff arbitrary strings, which is also quite handy for e.g. detecting normalization discrepancies, and installs a service so you can highlight a character in any app and use “Display character information” to see what it actually is.

I have Python command-line version in my PATH which displays the character info for arbitrary input strings: https://github.com/acdha/unix_tools/blob/master/bin/unicode-...


Using the Taylor Swift example from https://news.ycombinator.com/item?id=10438363 in the comparison window looks like this:

https://www.dropbox.com/s/9j9h5rjt4gu22hb/Screenshot%202015-...

Each hex value shown can be clicked to open the Unicode character info for that codepoint


This sort of stuff can be the basis for many XSS attacks, see http://websec.github.io/unicode-security-guide/character-tra...

For instance, \u2329, \uFE64, \uFF1C and \u3008 can be best-fitted automatically to \u003C (the regular '<' mark in HTML)


It is also good tool to check is Unicode supported well: just convert all user visible messages and then check interface of the program for <?> or [].


I had something similar happen in the wild to me.

I work for a "major search engine" that does a lot of advertising & marketing stuff. To get the most out of it, we need customers to implement some javascript on their ecommerce sites.

As is often the case, javascript code that needs to get implemented on an ecommerce site often gets copy-pasted or emailed around a lot internally within a customer before it reaches the right person who can add it to the site's pages.

In this example somewhere along the way, a normal javascript snippet got all of the semi-colons changed from ; to ;.

In case you've not already spotted it, ; is not a ; but is actually "Greek Question Mark" (http://www.fileformat.info/info/unicode/char/037e/index.htm).

It was very confusing why Chrome was moaning about a semi-colon an illegal token. I had a genuine "Am I going mad? Seriously?" moment before I realised what was happening.


I've been bitten by things like this so many times, that my first reaction on seeing an inexplicable syntax error is to delete and manually re-type the line.


I was recently bitten by OS X's non-breaking space shortcut. I'd accidentally typed alt+space instead of space due to typing a following # (alt+3), and that made weechat, an app unaware of the NBSP, see the command '/join #channel' instead of '/join'.


I would probably have to quit my job before I could figure out that problem. May I ask how did you spot it?


It was a while ago now, but I remember copying the code out to a controlled isolated environment on my local machine, picking the first line and making sure it reproduced.

After that it was a matter of just going through the usual steps of working out why something isn't working. I think in this instance I actually ended up manually retyping the code, and when getting "identical" code that worked that was when the penny dropped and I realised something was not what it seemed. If you paste suspect characters into something like http://unicode-table.com/en/ it will tell you right away what it really is.


Weird bugs can respond well to brutally dumb debugging. Open file and search for ; perhaps? Or grep etc. When you can see semi-colons but your find tool can't, that's a worry.

But yeah getting to that point relies on pure inspiration, in my case.


I find searching, and search and replace the most useful IDE feature, you can reformat entire documents with a series of well thought out search and replaces:)


Personally, my vim status line has an indicator which shows the hex code for the rune currently under the cursor.

After fighting against word processor quotes, it's become second nature to double check it periodically.


I just created a vim plugin "vim-troll-stopper" https://github.com/vim-utils/vim-troll-stopper

It protects you from these tricks by highlighting "troll" unicode characters in red.


How about,

  :syn match Error "[^ -~]"


That seems useful. Can you share it?


Here you go:

    :set statusline=%F%m%r%h%w\ [TYPE=%Y]\ [ASCII=\%03.3b]\ [HEX=\%02.2B]\ [POS=%04l,%04v][%p%%]\ [LEN=%L]
    :set laststatus=2
Produces a status line when inserting and recording a macro like (the character under the cursor is 'm':

    ~/.vimrc [TYPE=VIM] [ASCII=109] [HEX=6D] [POS=0123,0020][67%] [LEN=182]
    -- INSERT --recording
And, of course... `:help statusline`


This doesn't seem Unicode-aware at all. I put my cursor on the character 每 and it says:

  [ASCII=2>4] [HEX=0>4]
It's not ASCII, and its hex code is 6BCF.

It also says "ASCII=252" when I put the cursor over "ü". Claiming that values over 127 are ASCII is just a malapropism.


vi can be so elegant.

For us emacsen you can do a (what-cursor-position &optional DETAIL) which is usually bound to Ctl-x =

I don't think it will be nearly this clean to add it to the modeline, but I'll take a look.


Very nice! Thanks.


Cool! How do you do that?


I'd like to know, too. Until then, you can type "ga" in normal mode to get a display of the decimal, hex, and octal value of the character under the cursor.


vim-characterize has a nice enhancement of "ga", adding Unicode info, digraphs, emojis, and HTML entities. https://github.com/tpope/vim-characterize


Replied with it in a sibling comment.


Something similar happen to me, but with that fancy quotes. I spotted by doing a "binary-search weird bug hunt". Cut half of the code off and see it it's still complaining, if it's, cut the other half, and so on.


Ah, that old stand-by. Always warm and ready for the odd unicode hell-bug.


I had a similar problem. Copied a code snippet. Ruby started complaining about an undefined function. After nearly going mad, and then looking at the source through a hex editor, you could see Unicode whitespace. I have yet to forgive ruby or Unicode whitespace. Or the chat utility from whence I copied.


There is a special place in hell for anyone doing this. I'm going to watch this repo and blacklist pull requests from anyone who forks it :-)


They share the place with coding blogs that use &nbsp; instead of spaces for code snippets.


Or “real” quotes in code examples..


Or en-dashes instead of double hypens for command line flags... I can feel my blood pressure rising just thinking about it.


Hah, I was giving a presentation where I was running little queries as part of a demo. I had copied the queries into the presenter notes in PowerPoint, then pasted them one at a time into a web app to run them.

Couldn't figure out why one of them wasn't working, and it was actually an audience member who figured out PowerPoint had turned a quote character into a "smart quote".


I don't think people do this intentionally. Either the code snippet has passed through MS Word (why?) or their blog tool is being "helpful".


I once had to do a team project with another student who did all his coding in Wordpad, god knows why. His indentation was more or less random. I wanted to murder him.


It happens if you paste the code into an Outlook email (which used Word as the rendering engine IIRC)


Emailed code snippets have this problem when emailed using some email clients (Outlook, looking at you)


Damn you Google docs! they also do this :(


I've had to use &nbsp; a couple of times in order to get a proper indentation. I know there's <PRE> and <CODE>, but they aren't allowed on every blog, so I had to improvise.


Then for $DEITY's sake, put the code snippets on pastebin or similar. Nothing's more infuriating than having to hex edit a file just to get all the \u00A0 out.


Wouldn't a regular text replace work, switching one character for another?


> blacklist pull requests from anyone who forks it

Why, if I may ask? If they introduce compile errors in your project, those should be caught by the CI build and test run, shouldn't they? In any case, accepting a code change just by looking at the diff and without even trying it out sounds like not the best course of action to me anyway.


Yes. A good CI will blow up any malicious pull requests provided you are using a compiled language. Or if interpreted, you need sufficient code coverage that your tests will blow up


I can foresee a new phenomenon arising in stackoverflow-style sites and coding discussion forums:

"My simple piece of code looks perfect and should work without problems. Yet it won't compile! Help!"

Answer:

"Try running `./mimic --reverse` on your source."


I actually almost submitted something in that vein once. I'd type

  > ls | wc -l
and get

  > bash:  wc: command not found
As it turns out, I need Alt+1 to type a pipe character in my keyboard. If I'm not quick enough releasing the Alt key, I'll type Alt+Space instead of just Space, which inserts a Non-breaking space[1] in Mac. This character is not a space, and therefore it gave me a weird "command not found" error.

This lasted for months until I found out what the problem was - given that it was a combination of my keyboard settings and OS, finding the root of the error took quite some time. The hint? The "command not found" error had an extra space in front of the unknown command.

[1] https://en.wikipedia.org/wiki/Non-breaking_space


This bit me as well, as I mentioned in a comment above. The solution I found best was to make OS X not produce an NBSP on alt+space.


That should be a comment.


True, except in the rare situation where the question really features a code sample containing some evil homographs.

Edit: Oh hey, I actually found one that fits the bill:

http://stackoverflow.com/questions/14925894/trouble-with-arg...


I'm reminded how very useful I've found Text::Unidecode in the past.

http://search.cpan.org/~sburke/Text-Unidecode-1.27/lib/Text/...


Author of Python port of Unidecode here. I wrote a comment previously, pointing out that Unidecode does the reverse of Mimic. But then I actually checked the tables of characters that Mimic uses and deleted my comment.

Mimic chooses replacement characters solely based on their visual similarity with ASCII. Unidecode, while still doing character-by-character replacements without deeper analysis, tries to optimize the replacement tables for transliteration of natural languages.

For example, mimic will replace Latin capital H with Greek capital eta (U+0397), because they look similar. However, Unidecode will replace U+0397 with Latin capital E, because Latin E is typically used in place of Greek eta when transliterating Greek text to Latin.


I have used the php port long ago when creating a simple website search engine... Great project!


On a Mac you (used to?) get a non-ascii space when you hit the space bar while holding Alt or something like that. Easy to fat-finger it in any case and looks the same in most text editors. It's a great source of fun for novice Mac-using programmers to find out why the compiler complains.


This is still happening as of today:

ps aux | grep foo

zsh: command not found: grep

It happens to me at least every other day.


It's not terribly difficult to define custom keyboard layouts for OS X. Make a copy of your preferred layout and get Ukelele [sic] from SIL¹ to remove NBSP from Option-Space. (Or just hand-edit the XML changing "&#xA0;" to " ".)

¹ http://scripts.sil.org/ukelele


It still does! You've just explained something that has been annoying me for about a year now. Thanks!


The Commodore 64 (or some other machine from my childhood) would generate a non-ASCII space if you held down control (or shift maybe) when pressing it. To this day I'm careful about that. I didn't know it was still a possible problem.


A good IDE would pick this up.

I had the habit of pressing the ALT key a bit ahead of time before an OR operator.

if (foo || bar)


Ironically I have weird OCD where I always assume I made a typo, so I keep deleting and retyping code a few dozen character at a type, often in lines where I see nothing wrong. Over time this has just become something my hands do whenever my brain needs time to think about something else. So in a way I developed natural immunity to said unicode tricks ;)


I think you're not alone. A common error I've noticed is when you make a typo somewhere (that compiles) and copy and paste it in a different place where you have the correctly named symbol. It's often hard to see the typo because the eye fly over the word. So you erase and type it manually.


Hey I do the same!

Only with variables, and almost always when using array indexes that are not 'i' or 'x'.

Sometimes is annoys me, and I've noticed that it's worse when I use CamelCasing and not as bad when i_do_this.


There's a set of rules used on domain names to stop homoglyph abuse there.[1][2] Applying those rules to language identifiers would prevent this problem. It's also useful to apply those rules to login names for forum/social systems. The rules prevent mixed language identifiers, mixed left to right and right to left text, and similar annoyances.

[1] https://tools.ietf.org/html/rfc5893 [2] http://unicode.org/reports/tr46/


I guess someone should develop an IDE/editor plugin that marks non-ASCII characters outside of string literals.


Ruby now allows (some) Unicode glyphs as names (allowing for things like Δv).

    08:11:32 >> Δv = 3
    => 3
    08:11:39 >> p Δv
    3
    => 3
My solution when I have problems like this is to start building a negative regexp in vim:

  /[^-a-zA-Z0-9 \[\]]
I then add other symbols as I find them. I can usually find the illegal characters in about 30 seconds this way—and I can add the non-ASCII glyphs that I expect to be present to my regexp.


JavaScript allows a wide range of Unicode characters - http://stackoverflow.com/a/9337047


> marks non-ASCII characters outside of string literals.

Many programming languages support non-ASCII variable name characters now.


> Many programming languages support non-ASCII variable name characters now.

Just because you can do something doesn't mean you should.

It is usually worth keeping variable names and such in English in enable international collaboration. Also non-ASCII source files can get mangled in transit.


Well, it does happen, though – look at this weather data from a large German newspaper, it is in a custom format ('|' separated values) and in German: http://wetter.bild.de/data/meinwetter.txt

It happens all the time, everywhere, that people write code and stuff in their native language.


That's data. The suggestion is about variable names.


Well, the variable names of Bild.de (for example HTML class names) are also in German.

It happens all the time, everywhere.


Good point, didn't consider that. Although any good IDE/editor should catch the use of "undeclared" variables and functions.



I think the line about "Mimic substitutes common ASCII characters for obscure homographs" has it backward. Shouldn't it say Mimic substitutes obscure homographs for common ASCII characters?


Never occurred to me before, but here "substitutes" reads to me as being commutative. I read both as having the same meaning. (i.e. you end up with unicode homographs replacing your ascii) Just me?


Substitute works IMO the same way replace does[1].

Substitute poison for healthy food.

Substitute poison with healthy food.

The first means you take away healthy food and give poison. The later means you take away poison and give healthy food.

[1] Except that "replace X for Y" sounds weird, except in the common phrase "replace like for like" (and probably some others!).


In that case, you won't mind if I substitute poison for your favorite tasty beverage.


Technically, my favorite tasty beverage is poison.


Not to my ear. To my ear, substituting X for Y is the same thing as replacing Y with X.


Technically, it does both ;-)


s/for/with


Also GREAT if you're trying to identify untaken phishing domain names to register for your next scam!


A lot of unicode characters are blocked for domain names for this exact reason.


wouldn't you end up with the 'xn--' ascii expansion in the url window?


In Chrome, probably [1]. Other browsers don't seem to be as strict.

[1]https://www.chromium.org/developers/design-documents/idn-in-...


Hyperlink i.e http://.com How many people will double check the url bar and notice the url is actually http://www.xn--m3haa.com/? Not all of them.

:edit+seems like the three umbrella unicode symbols are not supported on hn, are they supported in e-mails?


Most modern browser will show you the unicode version.


Nowadays most modern browsers will revert to the punycode ("xn--") if there is any chance of confusion, cf. https://en.wikipedia.org/wiki/IDN_homograph_attack


Mimic author here... sorry, humanity...


Wow, now that's just pure evil.


Yes, seriously. This is why we can't have nice things.


One could name variables and functions to later identify whether code was copied (e.g. to find out whether somebody copied some GPL code).


Note to self: Run mimic --reverse on GPL code I copy.


Spotify used to have a security problem with this kind of characters:

https://labs.spotify.com/2013/06/18/creative-usernames/


> Replace a semicolon (;) with a greek question mark (;) in your friend's C# code and watch them pull their hair out over the syntax error

I'm not sure how frustrating this would be. Wouldn't most people just delete the character immediately and type a new one?


I don't know about C# compilers, but gcc gives me two errors, "stray ‘\315’ in program" and "stray ‘\276’ in program", which I suppose are the two utf-8 bytes. Rust says, "unknown start of token: \u{37e}". Either way, you get a pretty strong clue that there's a funny character present.


If faced with a linter error, I don't typically delete the marked stuff, write it anew and hope fingers crossed that the error would be gone. I would try to make sense of the message, how it applies, and what the error is. At some point though, I definitely would pull my hair over a greek question mark.


Only one character is going to be marked in this case, not a whole line or section of code. Deleting it and retyping it costs one second. I guess I've seen more than my fair share of encoding issues. I used to tutor at a university, so students were constantly coming in with code they'd copy/pasted out of their assignment (usually a Word doc) or from a web site.


I think that's a great argument. If someone mails the code, I hope to have the cleverness to suspect the encoding. However, I thought about a code repository or similar where this may be an issue, but most often is not. And I have seen some code where a wrong language character did not provoke a reasonable error, but some arbitrary parser error that went off in another line altogether (not necessarily C#).


This somewhat reminds me if this little entry on how "tolerant" JavaScript is...

https://mathiasbynens.be/notes/javascript-identifiers


This can actually be used productively, to see how your app reacts to weird input :)


The repo's README mentions a vim plugin to highlight Unicode homoglyphs. As an Emacs user, I did a quick M-x package-list-packages, thinking I'll find at least half a dozen equivalent Emacs packages.

To my dismay, there were none. So I spent the rest of my afternoon correcting this glaring deficiency. Fellow Emacs users, protect yourself from Unicode trolls and grab it here: https://github.com/camsaul/emacs-unicode-troll-stopper


This seems like a useful tool for fuzztesting your dev ops person, or if you are the dev ops person, for fuzz testing development. Fuzz for all!


Piping the result through TTS creates weird results (on OS X):

  echo "hello world" | mimic --me-harder 100 | say


Can anybody provide an audio snippet for those of us who use Linux?


I don’t have an audio snippet, but I can transcribe what the voice says on different runs. It usually pronounces random letters individually, but sometimes pronounces syllables with letters missing:

“L-W-R-D”, “L-L-W-R-L”, “hell-erl”, “H-L-er-D””, “eor-D”, “H-L-L-erl”, “L-L-W-L-R-D”, “H-L-W-R-L”, “hell-W-R”


Just chuck the code into an XML validator. Any character > 127 will be flagged as invalid.


is anyone aware of the reverse of this, a homoglyph normalization library? id love to be able to take strings that visually look the same and compare them against one master list, such as for spam detection


In cases like those I use unicodelookup.com to list suspicious characters :)


That site does not work for me. I paste into the input field and it automatically turns into %F0%9F%90%88.

Compare with https://codepoints.net/ instead.

Edit: great, HN is broken too.


Thanks, didn't know this one.


Add the following to your ~/.vimrc to always highlight non-ascii characters:

    au BufWinEnter * let w:matchnonascii=matchadd('ErrorMsg', "[\x7f-\xff]", -1)


These dang democrats done banned Ben Carson from google man!

https://www.google.com/search?q=Ben+Сarѕоn


Made a perl port: https://metacpan.org/pod/mimic (currently 50% faster)


In some languages which allow non-ASCII but aren't Unicode-aware (PHP, for instance), you can add significant, invisible zero-width spaces to identifiers.


    > var ﷺ = 1;
    < undefined


Hmm... I wonder if this can be used in browser source maps.


Some people just want to see the world burn...


Some men just want to watch the world burn.


And now I know what I'm doing for April 1st next year.


i smell a Notepad++ extension


YOU ARE A TERRIBLE PERSON AND I LIKE YOU




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: