

Homoglyph attacks: How to create an internet hoax - sp332
http://www.azarask.in/blog/post/does-google-censor-tiananmen-square-how-to-create-an-internet-hoax/

======
jbert
So...what do we do about it?

How about a standard algorithm/mapping table for grouping unicode chars to
known languages and flagging when a unicode string pulls chars from more than
one language? (possibly only flagging if a change occurs within a 'word').

A standard way of rendering unicode strings could then be too highlight
flagged chars (perhaps with a different colour). More sensitive areas (such as
the browser url bar) could fire additional alerts if any chars were flagged.

Feels like a nice minor RFC to me - anyone see a problem with it?

~~~
woodall
I like your idea of changing the color of the letters if they are in a
different language. I might actually work on that.

~~~
jbert
Cool. I think a well-defined algorithm for defining a flagged character is
important, to get cross-platform/application support and to validate the algo.
(Any bug in the algo is a possible security hole, moreso if people become to
trust that any dodgy characters _are_ highlighted).

An exposed API to handle UTF8 strings to get count and/or indices of flagged
chars would also be useful. Perhaps it could work as an extension to ICU?

<http://site.icu-project.org/>

If nothing else, the "change language within a word" should use a unicode-sane
definition of 'word', which ICU would give you.

Once you have the two pieces above, adding the colouring etc should be
reasonably straightforward. But it'd be a shame if the two pieces above
weren't factored out - since I don't think the widespread adoption would
follow otherwise.

------
tptacek
A really well-known old attack, far more painfully expressed in
internationalized domain names (IDN), where attackers can create pixel-perfect
DNS clones of "paypal.com".

Notable also as an example of an attack that the DNSSEC security model does
little to combat.

One of the biggest security weaknesses on the Internet is in browser UI
(something Aza Raskin should be shouting from the rooftops). Right now, users
are "trained" (to dignify what's happening) to look at the browser URL bar for
a name and a lock icon.

We don't need new protocols or even changes in old ones so much as we need
Mozilla, Microsoft, Google, and Apple to sit down and come up with a standard
set of UI idioms that will allow users to quickly and visually "authenticate"
a domain name from the cues made available to the browser.

~~~
sp332
_One of the biggest security weaknesses on the Internet is in browser UI
(something Aza Raskin should be shouting from the rooftops)._

He has been shouting it from the rooftops. Tabnapping, which got on Reuters:
<http://news.ycombinator.com/item?id=1376075>

And an older trick, now fixed in Firefox:
<http://news.ycombinator.com/item?id=201912>

~~~
tptacek
You're right, I shouldn't have implied he wasn't paying attention at all.
(Although Tabnapping has little to do with the problem I'm talking about).

------
petercooper
You can use this technique to mention people/companies/political parties on
Twitter but who you don't want easily finding the tweet (e.g. the vile racists
of the BиP ;-)).

~~~
wlievens
In this case, the и intensely reminds me of a swastika so the effect is
doubled!

------
ff0066mote
This article presents some interesting consequences of using a unified map
from numbers to glyphs for all the plethora of human languages.

Perhaps instead of Unicode there could have been an established way to change
character encoding mid-string? Regarding and treating control characters as
normal characters, as this article demonstrates we do with the mirror
character at least, is nonsensical. Maybe a string could have a primary
encoding, and tags could be placed around sections which use an alternate.

Under such a scheme, in an english string any cryllic sections would be
surrounded by "tag" bits. The viewing program could then handle the cryllic
part according to its capability and known environment:

\- An english terminal might render as nonsense english characters.

\- A capable viewing program could render proper cryllic glyphs according to
the cryllic encoding.

\- A good browser could render the cryllic glyphs but highlight them, display
them in a different color, or even remove them.

Likewise, right-to-left languages would be implicitly rendered as such
according to the capability of the viewing program.

But this has no-doubt already been suggested and rejected by somebody...

~~~
jerf
It doesn't really buy you anything. Everything you described is already
possible with UTF-8, even the first bit about misinterpreting UTF-8 as an
8-bit ASCII. You've still described a single mapping of characters to numbers,
you've just created a much more inconvenient encoding.

~~~
ff0066mote
One could argue that a filesystem which contains two files, one in ASCII and
one in EBCDIC, is on the whole just single mapping of characters to numbers.
Your point isn't well argued.

What I'm suggesting differs in that it would use consistent tags to indicate
deviations from the intended primary language of a string.

Surround sections of a string which _should be interpreted differently_ with
tags-bits indicating such. These tags _should not be treated as characters_ as
the mirror character described in the article obviously is. And finally,
display routines would easily be able to indicate what's what; any sections of
the string with tags indicating an alternate encoding could be displayed with
bold, highlight, or in a different color. (ie. As in
"ti<cryllic>a</cryllic>n<cryllic>a</cryllic>nmen square", except the tags
would simply be unused bit-patterns.)

To accomplish the same thing with a Unicode string, one would have to store a
list of ranges of code-points which correspond to the string's locale. Then,
anything which isn't in one of those ranges could be displayed in the
alternate color. (ie. A character outside of the range 32-126 would be
considered "not EN-US". For other languages, the code-point ranges might be
more complex.)

The scheme I suggested isn't great, but it would avoids having to compare each
character to see if it falls somewhere in a list of ranges for the current
locale.

------
rryyan
The unicode "mirror" character mentioned in the article was very interesting
-- I had not heard of its existence before. This Google search indicates its
potential for hijinks:

<http://www.google.com/search?q=%26%238238>;

~~~
ars
It's not the mirror character, it's the "Right-to-Left Override" character.
It's for bi-directional text to mark a word as being right to left.

~~~
nixy
‮ ‮It's pretty damn cool!

------
earle
Who exactly upvoted this? The fact that this is on the front page of Hacker
News is really disappointing.

~~~
cynicalkane
Why?

