Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
IDN is crazy (haxx.se)
162 points by signa11 on Dec 15, 2022 | hide | past | favorite | 136 comments


Yeah, IDNs should have never happened. Homograph attacks alone are reason enough for IDNs to be a terrible idea, and they cannot be fully prevented even with special rules, because different fonts might render distinct characters similarly that are not normally considered "homographs".

I feel IDNs are part of a larger trend of "Unicode should be everywhere", without any consideration for whether this is actually a good idea. Programming languages are guilty of this too, with many allowing almost any Unicode codepoint in identifiers. This has created an entirely new class of security vulnerabilities that are serious enough that VSCode now flags such codepoints by default if they look similar to certain ASCII characters.

Basic ASCII Latin characters are the only characters that can be entered using almost any input device in existence without additional tools or configuration. This makes them a universal baseline, and software should carefully consider whether deviating from that baseline is really worth the security vulnerabilities and incompatibilities it may introduce, just so people can write their source code comments in Vedic Sanskrit, for a programming language where every keyword and almost every function name is in English.


This is a very anglo-centric view. As someone who belongs to a small nation, I'm very happy and grateful that I'm able to use my language on the Internet as much as possible.


I'm not against Unicode in general. Text content on the Internet most definitely should support the full range of Unicode characters.

But that doesn't mean domain names should, or identifiers in programming languages. Those are critical, fairly low-level technologies, and allowing all of Unicode raises significant security concerns. I strongly believe those concerns outweigh the usefulness of non-Latin characters in those contexts.

The accusation of anglocentrism is unwarranted here. Every mainstream programming language ever created has all of its keywords based on the English language, including languages like Ruby and Python whose creators aren't native English speakers. This isn't anglocentric, it's a best-effort attempt to make the language as universal as possible in the world we live in. Internationalization at this level of technology simply leads to fragmentation. Domain names and programming languages are very different from user-focused free-form web content.


I'm not debating Unicode in programming languages but in domains.

Also, I disagree that domain names aren't user focused. It's the first thing you learn as a regular web user (at least that used to be the case before, now it's usually the app name).


On the contrary, the accusation is warrranted. Domain names are printed on billboards, in newspapers, shown in TV ads and a myriad other examples outside the programming domain. I have not heard anyone from a non-English speaking country argue your viewpoint.


> I have not heard anyone from a non-English speaking country argue your viewpoint.

Well, now you have. I'm not a native English speaker and I'm not from a country where English has any official status.


What characters are present in your native language, that is not in ASCII?


> What characters are present in your native language, that is not in ASCII?

is that a joke? or are you really under the impression that most languages use subsets of the Latin alphabet? even English uses letters that aren't in ASCII, much less literally every other one.


I feel like you have misread the comment you are responding to, as that question was almost certainly an attempt to verify that the person who was arguing the other side of the point wasn't willing to do handedly screw over the majority of the world that doesn't use Latin characters because their non-English language--which they had thrown out for street cred here--happened to not need anything that wouldn't be readily mapped with low corruption to the Latin set.


The English alphabet does not contain non-ASCII chars. Which makes it quite unusual.


Not entirely true. There is some use of umlauts. For example, The New Yorker insists on spelling re-elect as reëlect.


I don’t think people consider them part of the English alphabet. Which letters do they sort between, according to English sorting rules?

Also, “Those two dots, often mistaken for an umlaut, are actually a diaeresis… Most of the English-speaking world finds the diaeresis inessential. Even Fowler, of Fowler’s “Modern English Usage,” says that the diaeresis “is in English an obsolescent symbol.””

https://www.newyorker.com/culture/culture-desk/the-curse-of-...


Tell me how many of top 100 websites in CJK use IDN...


I am from Czechia, where many diacritic marks are used. CZ domain registry makes periodic surveys whether to allow IDN domains for .cz, and 'no' wins consistently with ~ 2/3 majority. See https://www.xn--hkyrky-ptac70bc.cz/ (there are also info in english).

The main argument against is that everybody is accustomed to the current situation and knows that when a brand contains diacritic marks, they should be removed when entering URL. If IDN is introduced, it would be confusing as some would use ASCII version, some IDN version, while most would just buy two domains instead of one.


Imagine living in a world where domain names can only be written in Arabic script. Would that "best effort" be good enough for you?


If Arabic had the reach and status of the Latin script, why not?

But it doesn't, so that's an irrelevant hypothetical.


Also known as "I was born in the right part of the world, who cares?"


Also known as "the way the world works".

Which is why almost every institution everywhere uses kilometers or miles to measure distances, not parasangs or barids.


> Also known as "the way the world works".

Wait, IDNs exist, they are "the way the world works". You are arguing that they should not exist, because you don't have a use for them and they inconvenience you. How can you justify your desire for the world to work differently by saying that it's "the way the world works"?


> "the way the world works"

Well, the way the world works created IDN and Unicode identifiers in languages. How about that?


TIL about parasangs and barids. Thanks!


It is starting to seem like you are the one perturbed by the idea of Arabic script, which is weird since your brought it up.


Arabic looks to me uniquely hard to read but I don't mind learning Thai, Armenian or Georgian script if it would be what almost everyone else is using. Even Hebrew and Hangul looks easier to learn.


How does DNS handle RtL languages?

Are palindromes they key?


Unicode has the concept of directionality built in already but it took some work to align with the existing assumptions of DNS https://datatracker.ietf.org/doc/html/rfc5893


Language serialization is tricky:

  Examples of Issues Found with RFC 3454

  4.1.  Dhivehi

   Dhivehi, the official language of the Maldives, is written with the
   Thaana script.  This script displays some of the characteristics of
   the Arabic script, including its directional properties, and the
   indication of vowels by the diacritical marking of consonantal base
   characters.  This marking is obligatory, and both two consecutive
   vowels and syllable-final consonants are indicated with unvoiced
   combining marks.  Every Dhivehi word therefore ends with a combining
   mark.


Why would that be a problem?


I think it is anglocentric and kind of outrageous, just that it's so deep rooted that most efforts are at most useless and every past attempts at correcting it basically had been pure annoyances. Those thinkings that translation is word-for-word dictionary lookups and that programming language is US English, are just annoyances.

Perhaps with current level and awareness of multi-lingualism in computer science/engineering, this whole discussions about anglocentrism in computing and possible solutions is too early to take place; very few are even aware, much less accept, that the spoken language active during a decision making affects its result for multi-lingual people.


Yeah no. English is the lingua franca of computer science and programming. It's one of the few areas where language is not getting in the way of international exchange so much. It brings people together. I can check out and understand any random project on github no matter where the author is from and what language they speak. The worst that can happen is I can't read the comments. Even then most people still at least use English names for variables and functions. Nobody stops you from thinking in your native language while coding.

What do you gain from IDN domain names other that nobody who doesn't speak your language can't even type or remember them? Apart from the obvious security issues mentioned in TFA. Same goes for handles or ids. Don't allow non-ascii for usernames to avoid scams. If you insist, have a separate display name that's displayed alongside.


> What do you gain from IDN domain names other that nobody who doesn't speak your language can't even type or remember them?

That people who do speak the language can type them?


Oh right, before those IDN domains existed, people were just sitting there in front of their computers unable to visit any website, since they couldn't type them out.

Every keyboard layout in existence for languages not based on the latin alphabet has a trivial way to switch to Latin input.


Practically that only works for European languages with extremely limited homographs and no IME prerequisites.


I appreciate every project I come across that has contributors lists using Unicode, or documentation or other resources.

It's a reminder that we live in a shared world.

And if we start seeing more keywords in alternative scripts...

That's more improvement in strong typing, IDEs, LSP, dev containers, static analysis and a slew of related technologies. It's exhausting, but it seems a little more fair than the URL debate. Of course there will be human auditing and approval advancements there that are exploitable. But it is what it is.


I've used at least one programming language whose keywords and supporting library function names were not in English. There exist more than one.


Programming language is for programmers (and there are some non-English lang for education, and you can create any lang) but domain name is for everyone so not good example.


Alright, then let’s switch to Egyptian hieroglyphs for domain names and the other critical technologies you identify.


On the contrary, this is going to make URLs hard to use and would drive it to irrelevance in some countries if IDN usage becomes more common. Take CJK languages for instance. There are just so many characters that you are required to use predictive typing for every piece of text. This means that it's impossible to type CJK URLs with deterministic keypresses as you're constantly choosing suggestions which gets reordered all the time.

As for the homograph problem mentioned in the article, that's not an anglo-centric fluff. It's an actual major problem that affects speakers of all languages. It hinders the ability of people to reliably type or recognize domain names. That's going to cost some people their life savings.

Overall, IDNs would cause more harm than good by making URLs unreliable. It would just serve to empower search engines.


As much as possible can stop short of DNS names. My native langues doesn't use a Latin (or Latin based) script and think that it is very easy to learn to type and read (but not pronounce) names written in Latin script. It doesn't require learning English as a language.

Also with a push from Google (and Firefox too) most users no longer type URLs - they enter a search query and go to a site from search result. I don't like this trend but site names become increasingly invisible for users.


Personally, I have disabled IDN (enabled show-punycode)... If I there was a way to just whitelist the 3 extra characters in my language, I might allow it.

Also, Firefox sadly has all individual countries TLDs enabled by default if I enable the whitelist. It should be the opposite.


Romans want to be able to dial a phone number without being forced to translate to Arabic numerals.


I kind of agree to that and would hope to be able to revisit it, but no earlier than a century and half later. We are not there yet by at least that much.


Homograph attacks are still a problem without unicode. The entire concept of "it looks like the right text to the user" is just a horrible security concept out of the gate. You want warnings on nearly identical looking tokens regardless the reason, even if it's not security related.

As far as input devices have you used an IME hands on before? Non-latin characters aren't really any more "additional tools or configuration" than when during install you select "English" as your language. Others may select something like "中文(中华人民共和国)" and now without additional tools or configuration they are entering via pinyin instead of qwerty even though it's the same physical keyboard symbols being pressed. These IMEs don't always have a 1:1 latin character to unicode character mapping you can rely on either e.g. for the prior example on modern Windows the same set of latin characters can result in different outputs in a similar way to how english autocompletion suggestions on a mobile device work.


> Homograph attacks are still a problem without unicode.

Sure, but with ASCII there are just two or three sets of homographs ("l1I|", "O0" mostly), whereas with Unicode there are potentially thousands of confusable characters, and more are constantly being added.

> As far as input devices have you used an IME hands on before? Non-latin characters aren't really any more "additional tools or configuration" than when during install you select "English" as your language.

This ignores the millions and millions of devices still in use (and not going anywhere) that don't have "input methods" or anything like it.

Try setting up an IME on a computer running DOS (which is still everywhere in many public offices), on an industrial machine with a basic keyboard, on an aircraft avionics system, etc.

Basic English Latin is the common denominator. That won't change until all those "long tail" devices are gone.


There are indeed more characters but the entire class of additions can be boiled down to "check for characters from mixed scripts in the same token" not "compare n^2/2 characters for similarities" at which point you're back to the original homograph/homoglyph problem.

Regarding the millions of pre-unicode devices the exact same can be said about public machines running DOS, industrial machines with basic keyboards, aircraft avionics systems, etc of non-English countries too. After all the rest of the world didn't just use latin text or avoid computers until Unicode finally appeared and was supported. Many did, many didn't - the systems and encodings being so painful is what resulted in Unicode.

The whole point of using Unicode for base things like programming languages, domains, etc is so everyone from all languages (including English) can take advantage of the 10s of billions of devices that do support unicode instead of being limited to the long tail of a millions of devices until every last one is turned off. The support of these use legacy use cases was designed into Unicode and is why ASCII maps to the first 8 bytes, Unicode is just smart enough to not require everyone do that for every device because some 30 year old machine in a warehouse needs it.


> Others may select something like "中文(中华人民共和国)" and now without additional tools or configuration they are entering via pinyin instead of qwerty even though it's the same physical keyboard symbols being pressed.

That doesn't work; without additional configuration you'll just end up entering the same ascii that matches the keys you press.

You have to enable the IME. (In the example you appear to be suggesting, you do that by hitting shift.) You can't be in Chinese input mode all the time, because that would make it impossible to type non-Chinese text.


Shift will switch modes but whether you need to shift to get out of latin or vice versa depends on the context. A legacy app will almost always default to "latin, shift to pinyin" while a modern app will almost always default to "pinyin, shift to latin" - exceptions for when the app is self-aware the other input makes more sense in the context. Examples of each respectively (on the current Windows 11 at least, not sure about others) is File Explorer and the Start Menu.

Regardless I don't consider this additional configuration anyways in the same way I don't consider holding shift to get capitals changing your input configuration but maybe that's just me.


> I don't consider this additional configuration anyways in the same way I don't consider holding shift to get capitals changing your input configuration

That's a command; hitting shift by itself to determine your input mode would be better analogized to capslock.


Personally i think the unicode code point in identifiers/source code drama is silly.

The attack is someone you don't trust modifies your source code, using unicode to make the modifications more subtle. What type of threat model is that? There's tons of ways for a malicious party to sabotage source code. Most are probably more subtle than unicode hacks. It is like complaining about how the person you gave a key to your house to might rob you.

I do think that in a code editor already doing syntax highlighting, individual tokens should be bidi isolated so you can't have RLO and friends affecting too much, but beyond that i really think this whole issue is way overhyped.


> There's tons of ways for a malicious party to sabotage source code. Most are probably more subtle than unicode hacks. It is like complaining about how the person you gave a key to your house to might rob you.

That's not at all how the open-source world works. Many unknown (to the project maintainer) people can create pull requests and try to sneak in a backdoor through, say, an homograph attack.

    if ( id == "root" && some_other_condition )
Pull requests comes in:

    if ( id == "rοοt" && some_other_condition && some_even_better_condition )
I totally vet that! Beautiful added security measure, that "some_even_better_condition" is the nuts, nothing can go wrong: it's an added protection! I merge that ASAP and release!

Except your diff tool missed that on that same line "root" was changed to "rοοt" (and the attacker created an account named "rοοt" on your website).

Game over.

The point is not to bitch about my pseudo-code or how account creation do work: the point is that it's easy to miss an homograph attack during a review.


> Except your diff tool missed that on that same line "root" was changed to "rοοt"

I feel like many diff tools highlight character changes besides line changes. This is especially helpful when someone makes changes to a long line.


Not to be too pedantic but this kind of comparison is poor form in the first place.

It would sidestep the entire issue if the code properly checked ACL's, role, or userId.


> There's tons of ways for a malicious party to sabotage source code.

If the malicious party has unfettered write access to your source code, then sure. But if all changes get reviewed by a maintainer before being merged, then there's way fewer ways, and Unicode attacks are a really large portion of them.


The problem is very real, and Unicode certainly made it worse, but you can have the same issue with ascii. Capital I and l looks the same in many fonts and ad-hoc mitigations have been made long before computers (eg license plates). And you’d be surprised how few non-techies know subdomain hierarchy rules (facebook.com.foo.bar).

Add stress, dyslexia, poor eye sight etc on top of those and you can see how this becomes a real issue. Humans simply aren’t very good at string equality.

As for mitigations, in ascii only you could color code lowercase/uppercase/numbers/symbols differently I guess, enforce font that renders glyphs different enough (probably monospace), highlight the domain name, etc. But even these solutions only go so far, and doesn’t naturally translate to all of Unicode or all situations where identifiers are needed.


I actually have some firsthand experience with the IDN governance (long story short, I and other people opposed a certain kind of expansion in allowed Korean domain names [1]), and I was once told that IDN basically happened because the PRC government threatened that they will make their own DNS equivalent otherwise. I don't know how much of this is true---while who said this is supposed to know what's going on, I don't know if one actually knows that---but it did sound plausible given the haste nature of IDN and subsequent measures like LGRs.

[1] https://no-hanja-domain.github.io/ (in Korean)


We don't need to throw the baby out with the bathwater here. We can have non-English domain names without the security risks. We just need to do a better job of indicating to the user that a domain name is written using, say, kanji and not ascii. Full unicode support, I.E. allowing mixed character sets in a domain name, should be disallowed, however, as displaying to the user "the first character is kanji, the rest is ascii" is too cumbersome.


Disallowing mixed scripts in labels is not necessarily workable. Consider that in South Korea it is common to use the English "ing" gerund ending in otherwise Hangul text.

That said, that would be the a very good step towards ameliorating the confusables issue in IDNs.


Sincere question: if Russians have first made the modern computer (and therefore the de facto language of computing is Russian and Cyrillic), are you okay with only using Cyrillic as the sole script?


> if Russians have first made the modern computer (and therefore the de facto language of computing is Russian and Cyrillic)

The first part of this sentence does not imply the second. The world's first programmable computer was in fact created by a German. That doesn't make German the language of computing.

> are you okay with only using Cyrillic as the sole script

No, because Russian/Cyrillic lacks the properties that make English/Latin so useful as a universal language/script:

* Almost all languages have widely used, standardized transliteration systems into a Latin script. That's not true for Cyrillic.

* English is far more widely spoken and read than any language written in Cyrillic.

* The subset of the Latin script used by English is essentially the common denominator of all languages written in Latin. That's an extremely important property and immediately predisposes English for the role it has today.

There are objective arguments for English as a universal language for computing. This isn't just "cultural imperialism" or something. English is not my first language, but if I try to imagine my first language being used in its place, I immediately see dozens of serious issues that would arise, all of which are non-issues with English.


> There are objective arguments for English as a universal language for computing. This isn't just "cultural imperialism" or something. English is not my first language, but if I try to imagine my first language being used in its place, I immediately see dozens of serious issues that would arise, all of which are non-issues with English.

The "objective arguments" you're saying is simply hand-waving on the fact that the English (and Americans, which mainly descended from English pioneers) have spoken English. I should remind you that French is still the primary language of diplomacy (even it looks like English has replaced it) and French has also a robust transliteration system to many languages (and in some cases which is even better than how English handles it due to its fewer phonological edge-cases), mainly because it also have historically controlled some parts of Africa, Asia, and the Pacific, and have a serious diplomatic force for areas which the French haven't bothered to conquer with.

Now, the question is simply what if Cyrillic is the lingua franca of the computing world, and you've never answered it directly. You instead doubled-down on the supposed benefits of the standard English and alphabet which is not what I'm asking. You might not have an answer here, being that it's hard to imagine that alternate reality, but I should remind you that a lot of people globally have significantly different scripts and doesn't know any Latin letters. Sure, the Chinese knows it, but it doesn't apply to many other people in Asia and Africa.


I think you are doing some backwards reasoning based on the prevalence of English in computing to justify why we would only need the English alphabet.

> * Almost all languages have widely used, standardized transliteration systems into a Latin script. That's not true for Cyrillic.

There may be standard transliteration systems for non-English alphabet languages, but there are often multiple standard transliteration systems (e.g. Russian, Arabic). Also, Chinese transliteration is pretty gobbledygook without diacritics that aren't present in the English alphabet (and kind of gobbledygook even with diacritics).

> * English is far more widely spoken and read than any language written in Cyrillic.

What if the de facto language of computing was Mandarin instead?

> * The subset of the Latin script used by English is essentially the common denominator of all languages written in Latin. That's an extremely important property and immediately predisposes English for the role it has today.

The Latin script used by English is exactly all the letters you need to write English. Many other languages written in Latin script either have more (e.g. Spanish, Norwegian), or fewer (e.g. Italian, Serbian), or have different rules for certain things (e.g. Turkish).


> What if the de facto language of computing was Mandarin instead?

Mandarin isn't "widely" spoken. It's spoken by a large number of people, almost all of which reside in a single country. This makes it utterly useless as an international standard.

And the Chinese writing system would have been far too complicated to become the foundation of computing. We're talking a couple dozen characters vs. tens of thousands. Only recently did it even become possible to accurately represent the whole gamut of Chinese writing using digital systems.


Sure, I agree that far fewer people would want to use Mandarin as a lingua franca compared to English, but that's not really a computing-specific thing but more a reflection of its cultural dominance in general.

That said, technology follows the needs of people, not the other way round. If we were living in a parallel reality where Mandarin had the cultural cachet of English, I'm positive we would have immediately subjugated computers to deal with Chinese characters ;-)


> Mandarin isn't "widely" spoken. It's spoken by a large number of people, almost all of which reside in a single country. This makes it utterly useless as an international standard.

The norm for international communication is that zero people on either side speak the language. Internal communication anywhere in Achaemenid Persia took place in Imperial Aramaic, which was spoken by some subject peoples in the west of the empire.

International communication between Mitanni (speaking an unnamed Indic language) and Egypt (speaking (Afroasiatic) Middle or Late Egyptian) took place in (Semitic) Akkadian.

International communication between Japan and Korea took place in classical Chinese, spoken by neither side and unrelated to the languages spoken in either. Similarly, international communication between Vietnam and China took place in classical Chinese, spoken by neither side (though closely related to the languages spoken in China).

Treaties between early modern Russia and China could not be concluded in classical Chinese, so they were concluded in Latin, spoken by neither side.

It just isn't an expectation that international communication will use a language that is spoken by any party to the communication. This is for the obvious reason that anyone who wants to join the conversation uses whatever is already in use, even if the language currently in use has been dead for a thousand years.


Overseas Chinese is a thing


> The Latin script used by English is exactly all the letters you need to write English.

Only true in the sense that that is how English spelling is currently defined. If you wanted to come up with a script for English from scratch, it would look nothing like the Latin script. It would not be isomorphic to the Latin script.

Compare this inventory of phonemes in General American English, with 24 consonants and 13-15 vowels, of which only 5 are diphthongs: https://en.wikipedia.org/wiki/General_American_English#Phono...


> Almost all languages have widely used, standardized transliteration systems into a Latin script. That's not true for Cyrillic.

Well, it's not true by any measure. It is not the case that almost all languages have widely used, standardized transliteration systems into a Latinate script.

> The subset of the Latin script used by English is essentially the common denominator of all languages written in Latin.

This isn't true either. Why would an Italian consider the weird English letter 'j' more universal than the weird French letter 'ç'? The French might not mind 'j', but what are they supposed to think about 'w'?


>No, because Russian/Cyrillic lacks the properties that make English/Latin so useful as a universal language/script:

Which is in particular?

>Almost all languages have widely used, standardized transliteration systems into a Latin script. That's not true for Cyrillic.

Combination of historical circumstances and you confusing cause and effect. You can write english in cyrilic.

>English is far more widely spoken and read than any language written in Cyrillic.

See above.

>The subset of the Latin script used by English is essentially the common denominator of all languages written in Latin. That's an extremely important property and immediately predisposes English for the role it has today.

Why? It makes english almost the only language you can write properly. This is a bad property.


Yes? Cyrillic would only take a few weeks to accustom to.


I really meant that for the GP, but just to put it on the record I'm also answering in the positive.


I remember a radio ad (before IDN, obviously), where a company having an "ö" in their name had to repeatedly tell their customers that their domain is written "oe" instead. I'm sure this issue has cost them some real money.

Domain names are nothing like programming languages.


Yes it used to happen all the time and it was butchering words right and left. For someone that has no affection to their language, that may feel ok, but to me the advantages of IDN outweighs the complexity without doubt.


This is such an American(TM) view it's incredible. People have the right to have computers work in their native language and alphabet, your input method be damned.


How do you reckon that’s a right? Seems more in the realm of “reasonable desire”. And if you want to work towards fulfilling that desire then great and more power to you, seems like it’s got some big challenges.


Linguistic rights are very common in many countries, e.g. the right to interact with the government in various official or minority languages, or to attend school in that language.

https://en.wikipedia.org/wiki/Linguistic_rights


Because people's computers belong to them, they have the right to use them in their own native language and character set. If we look at the problem from an even larger distance, it's people's right to their culture or self-expression. If these are not reasonable rights people should have then what?

In the end the anglosphere should never be able to dictate the usable charset on the entire internet. That'd be utter nonsense.


I feel like your moral outrage has sort of blinded you, did you read TFA? Talking about it in terms of rights is silly because 1) even if we agreed that it was a right it’s not something that’s being withheld, it is instead a problem no one has yet been able to solve and 2) in terms of rights everyone is already free to program in any language they want, though their choice of encoding may make interoperability with existing systems difficult


I think this thread isn't discussing the article, but the borderline-flamebait comment by p-e-w that "IDNs should have never happened".

I find that suggestion outrageous.


Yeah, I think you’ve identified the disconnect, thank you. I was definitely more zoomed in on the rights bit.

To the topic at hand, why is that outrageous/flame bait? They seem to have pretty sane justifications, whereas most of the rebuttals seem to be focused on presumption of personal details. Not that I expect you to be immediately swayed or anything, but I don’t understand why you wouldn’t categorize that as disagreement rather than flame bait (am I missing some context?)


The purpose of IDNs is to allow people to use the Internet in their local language.

p-e-w's comment is uncompromising and dismissive. It doesn't seek to find solutions, but ridicules the very idea of someone wanting to use a language other than English in source code and domains.

When challenged, they have pushed the conversation towards source code, a tangent they introduced, rather than IDNs which most of the discussion is focussed on.

The HN guidelines say:

> Eschew flamebait. Avoid generic tangents.

> Please don't pick the most provocative thing in an article or post to complain about in the thread.

The tone of the comment could hardly be less provocative.

(A supermarket receipt on my desk just caught my eye. It says "se åbningstider på www.føtex.dk" at the top.)


Just because you personally disagree with it and dismiss all arguments presented based on your feelings doesn't mean it's "flamebait".


(I replied to the sibling comment.)


I will probably write a character set checker for ascii tomorrow, I agree unicode is pretty much a security vulnerability anywhere but the frontend (for non copied display).

Heck, it might even be a reasonable idea to bake anti-unicode into the system (kernel) and require elevated access to process it at all in a sandbox...


The biggest problem with IDN is that it allows mixed script labels, but even that it kinda has to (e.g., in South Korea it's common to use the English "ing" gerund ending to alter Korean words spelled in Hangul).


I disable Unicode at every opportunity I get to do so, because I honestly do not need it. Unicode as a "feature", especially one that adds complexity and potentially security risks, should be optional, not mandatory. IMHO.

FWIW, I agree with this viewpoint 100%. Thanks for stating it so clearly.


> The whole Unicode range is at our disposal

Sadly, no. There are actually huge swaths of Unicode disallowed in IDN [0]... although not everyone is aware of this and/or implements this check. Which is a shame, I'd really like to have e.g. "🃏-vd.name" for a homepage.

[0] https://www.iana.org/assignments/idna-tables-12.0.0/idna-tab...


Just in case anyone was losing sleep over the first example, "räksmörgås" in Swedish means "shrimp sandwich".

It's a well-known semi-posh dish [1] (and yes it's an open sandwich since in Sweden that is the default), as well as a fun word since it contains all of our three national characters (åäö) at the same time, thus popular among programmers when dealing with i18n etc.

[1]: https://sv.wikipedia.org/wiki/R%C3%A4ksm%C3%B6rg%C3%A5s


looks like a salad to me (testing to see if this is a flame war vector)


Yeah, those pictures sure make it look like one. But there is a slice of bread underneath there, otherwise it is a salad!

Also, no I don't this is a good way to provoke a flame war with Swedes. I can barely register an emotion after seeing this.


This is the kind of mess that you end up with if you don't take internationalisation into account when you first design your standard. The impact of such an attack would've been significantly reduced if other countries' writing systems had been considered when DNS first hit the public.

Punycode is a bad compromise but nothing better is available. At this point entire TLDs now rely on it, it cannot be phased out any more. Hopefully this will be a lesson next time someone invents a protocol (though I doubt it will when I read through the comments of some anglophone commenters here).


Besides u/ipython's comment that Unicode did not exist when DNS was designed, consider too that Punycode is not the problem here. The problem here is the surfeit of homographs (similar looking glyphs w/ different codepoints and semantics) in Unicode. Punycode is not the issue. IDNA itself is not the issue either.

Punycode is complexity, for sure, and it could perhaps have been avoided by just decreeing that non-ASCII labels are UTF-8 on the wire in DNS. This was considered, and you can imagine the lengthy threads that that topic produced then and still occasionally produces now at the IETF!


Since dns was first defined in November 1987, well before Unicode 1.0 was ratified (1991 at the earliest), what would be your suggestion to the original implementers?


Total aside, and I have absolutely no solution for this, but: Perhaps the meta-meta-meta-problem here is that our industry is seemingly fundamentally unable to let go of flawed solutions from decades ago. Of which we have a lot -- by necessity. It's pretty much impossible to know the important flaws before something has been established and used widely enough, after which point replacing them becomes a herculean task because they have been calcified into the bedrock of our infrastructure.

There are so many examples of this:

  - DNS (for a lot of other reasons than domain names)
  - IPv4
  - TCP (see the motivation for QUIC)
  - a *lot* of incredibly obscure text encodings
  - IBM 3270 control codes
  - Sometimes even just ancient code no one understands anymore (search for the source code of Plan 9's `troff` for an example)
  - …and so on and so forth
…and if someone wants to tackle such a situation, usually the outcome is either:

  - build another hack to add to the pile of hacks we won't be able to get rid of
  - XKCD 927 (n+1 competing standards, because the old one will live on for ever anyway)
  - face second system syndrome (resulting in an overcomplicated mess)
  - get no traction because not enough people will switch to your new and shiny solution (due to the other calcified things in the stack or new flaws with the new solution)
So, my question: How should our civilization deal with this?


Even before unicode internationalisation was a thing. They couldn't have used unicode itself but they could've come up with some kind of mechanism to deal with non-English encodings.


Time to revisit the classics from the great minds in our industry, who foresaw all these SNAFUs...

Bruce Schneier, more than 20 years ago (a short read, I highly recommend it):

Security Risks of Unicode

I don’t know if anyone has considered the security implications of this.

Unicode is just too complex to ever be secure.

https://www.schneier.com/crypto-gram/archives/2000/0715.html


Computers are made to do things. The most secure computer is an unpluged computer but that would be a useless computer.

What is the alternative here? Not supporting non-ascii chars?


Not supporting non-ASCII chars in domain names.


Perhaps good trade-off would have been that non-ASCII is allowed only in country code TLDs, with these country operators limiting to only characters in their own alphabet.

E.g. for Finland .fi could do fine with just allowing ASCII + ä, ö, å.


That is exactly what .FI and many other CCTLD registrars do: https://www.traficom.fi/en/communications/fi-domains/native-...


But I don't think all of them do that. What if browsers, etc. were updated so that IDNs under TLDs where that rule isn't strictly enforced were always rendered in punycode?


Firefox already does this, with some improvements.

https://wiki.mozilla.org/IDN_Display_Algorithm


That's a good start, but IMO it's still too lax.


DNS registries are free to come up with rules like that, FYI. (And they should.)


How would that be represented on the wire? I don’t think restricting it in that way makes sense.


Wire format could be the same. But to prevent eg. homoglyph attacks the principle on what non-Unicode characters are allowed would be 'default deny'.


Oh, of course. I retract my previous objection, brain fart on my part.


Having a 1:1 correspondence between characters and code points would have been nice.


I’m assuming you mean a 1:1 correspondence between code points and glyph shapes.

In that case, you won’t have roundtrip conversions with any legacy encoding beyond Latin. For example, every legacy Cyrillic encoding treats the Latin A and the Cyrillic А as different letters. (Lest you try some sort of context-sensitive transform, both are in use as single-letter word: a French verb form and a Russian conjunction, respectively.)

Even if we could deal with that, the Cyrillic letter that is written as д (pronounced [d]) when used in printed Russian is written exactly like a Latin single-storey g when used in Bulgarian (never as a double-storey one). This is a problem for your idea even in vacuum, but every legacy encoding also treats this as a mere font difference.

As a further example, a and ɑ denote different sounds in the IPA (present simultaneously in French), yet there are fonts where the former looks like the latter (this is frequent in italics, but e.g. regular Futura and Andika do that as well). Those fonts are unsuitable for IPA, of course, but that shouldn’t mean a universal encoding must be unsuitable for it as well. Then there’s the Greek α, which is definitely not an a but kind of like an ɑ.

Do you want to distinguish the German Eszett ß (sometimes has a small protrusion on the left due to its origin as a ligature of long S + S/Z) and the Greek β (never has one)? The Greek τ, the Cyrillic т, the Latin m (used as the standard shape for т in Bulgarian), and the Latin(!) m-overbar (used as the standard shape for т in Serbian)?

I guess what I’m getting at is that the equivalence classes under “some language’s writing tradition has a glyph for C1 that is very similar to some other language’s writing tradition glyph for C2” end up much larger than anyone would ever want for general text-on-computers use.


Round tripping through conversions to other codesets was always only temporarily needed, and for most non-Unicode codesets it was always limited to one or a handful of scripts (e.g., you could not round-trip text that included CJK and Cyrillic through any non-Unicode codeset). So I'm not sure that having that feature was all that important. However, early on it was indeed helpful to obtaining adoption to be able to convert with simple lookup tables.


I think we should have at least an agreed upon subset of Unicode that squashed characters which are deemed too visually similar into singular code points, eliminating the concept of caps vs lowercase (if the language supports it), and remove any lexical control characters to get an analogue if the the “pure” hostname-style charset we enjoy in ASCII that we find in RFCs standards, for example. I’m probably naïve about this, but this is how we treat similar problems within specific applications of the ASCII charset.


There certainly are some things that could've done better and I guess still some things we could do better avoiding past mistakes.

But I have a strong bad feeling what's here already is going to remain here for the foreseaable future and the best we're gonna get is better handling rather than large changes to unicode itself.


Well we certainly could do better, we're never be perfect on this. After all, paypa1.com is the same attack and doesn't even use unicode.


there's valid reasons to have different code points with the same character though. like Russian с and Latin c look identical to me, but fonts might have ligatures or kerning rules that apply to one but not the other.


Depending on how far you want to take this - 1 and l look the same on many fonts.


With regards to his unicode rendering problems that forced images of text. He should check whether his wordpress is not using a utf8mb4 configuration with mysql.

Also. Perhaps domain names should be validated against the Unicode confusables list, forbidding confusable collisions.

https://util.unicode.org/UnicodeJsps/confusables.jsp


Many TLDs already forbid mixing charsets. Meaning it's often possible to use only cyrillic or latin, not both. Very special unicode, including emoji, are forbidden by majority.


"Many" and "Majority" is clearly not all, so there clearly still a risk. I know there's quite a few emoji domains out there. Tutorials on it pop up on HN from time to time.

Also, gonna bet that there are still confusables within character set constraints. A trivial example, the unicode confusable set considers "rn" and "m" to be confusable (in certain fonts and sizes). But I bet that's even more so in the broader european set.

Also it's a bit sad to be unable to, say, mix math symbols and english...

It seems to me that using the confusable set would be stricter (and safer) but also more flexible too...


PSA: all Firefox users should go into about:config and set network.IDN_show_punycode to true. This will completely protect you from these kind of attacks, at least in the browser.


Want to hear a 1ame joke?

Look closely, and you'll see it.

Your solution helps. But the better solution being pushed for, which is frankly the correct one, is "out-of-band" certification of domain names. Whether it's a little padlock next to the URL, or locking down the browser completely on non-certified domains. This will of course require more infrastructure, human and technical.

And, ugh, we'll get moral leakage.

Your advice is good. It's helpful if you don't fall victim to feeling too secure.

But, for the debate as a whole, I'm exhausted reading through everything like a lawyer. And dealing with it as a crisis instead of a meeting of minds.


> Want to hear a 1ame joke?

On both my computer and phone, your "l" vs "1" swap is obvious even at a quick glance. But "lame" and "lаmе" look exactly the same in almost every font.


He does not mention that domain registries will not allow such domain names to be registered.


Some domain registries won't. But unless that "some" ever becomes "all", IDN attacks will still be possible.


That only affects those domains who exist in those registries, on those TLDs. If my domain has a TLD managed by a registry which disallows homoglyphs, my domain is safe from impersonation.


Your domain's registry's policies do not affect labels below your domain.


OK, so how would that work, for an attacker? If an attacker wanted to impersonate foo.com, and the .com registry does not allow homoglyphs, what could an attacker actually do?


See TFA, which shows the use of confusable homographs for forward-slash and question mark.


But if, as I wrote, the registry for .com does not allow homoglyphs, then an adversary cannot register such names.


You clearly did not read the actual article. Here's a sample from it:

  $ curl https://google.com/.curl.se
  curl: (6) Could not resolve host: google.xn--com-qt0a.curl.se
Get it? That `/` after `google.com` isn't an ASCII `/`, and the TLD (se) doesn't have anything to do with this.


Ah, right, if you could find an unscrupulous TLS, say, .evil, one could register xn--com-qt0a.evil and set up so that it has a subdomain "google", making google.com⁄.evil lead there.

OK, I guess then that IANA should disallow all TLD registries from registering such names.


The registered domainname in the above example is curl.se, and the registry is se. As far as the registry is concerned, the domainname is all ASCII. The evil label is created by the operator of the registered domain, not the registry. The only way the registry can act in this case is to ban the domain if the registry observes the attack, but they might never see it.


Oh, I see. In that case, I guess it’s up to the browser. Maybe the address bar should not be displayed as a string, but as separate fields for the different URL components? One field for protocol, one for optional user name (for HTTP auth), one field for host name, one for the path, etc.


Yes, it's up to the browser -- the UI anyways. The UI should do something to protect users from this. One thing that would be nice is that IRIs you enter, should be in Unicode, while other IRIs should be displayed in A-label form (punycoded), and there should be a button you can use to switch from U-label for to A-label form for the window location URI.


Some of this is more a complaint about Unicode than about IDN. I'm not seeing suggestions of what would have been better. For example, regarding the confusables mappings, the list simply cannot be complete as Unicode can add new homographs in the future. Of course, IDN is crazy, but much of the craziness comes from Unicode.

Still, if the craziness comes from Unicode, one could be forgiven for wishing IDN had never happened and that Unicode in domainnames was not a thing.


So are there any mitigations for curl users? Anyone know if it's possible to disable IDN by default, and only enable it when (if ever) its needed?


Daniel needs to explain why it's enabled by default if it's so "crazy"


Because people who write come up with defaults are only human, so they sometimes make bad decisions, and this is one of those times.


Danny? You here? Were you able to get into USA after your ESTA or whatever got denied?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: