
Fun with Unicode in Swift - Razengan
https://tworingsoft.com/blog/2018/12/10/fun-with-unicode-in-swift.html
======
adwn
Unicode in identifiers is an exceptionally bad idea, with a miniscule upside
and horrible downsides:

1) As the article demonstrates, this allows obfuscation contests in easy mode.
Strangely, the author claims:

> _Nor do I think most of these tricks would get by a code review or go
> unnoticed using something as simple as syntax coloring._

How would syntax highlighting catch that? Those are valid identifiers! And
yes, these "tricks" would almost certainly go unnoticed in a code review, at
least when you don't assume that your coworker is maliciously inserting bugs.

2) Even in the absence of malicious intent, Unicode identifiers open the door
to bugs and head-scratchers: different characters looking identical, or the
same character encoded in different ways.

3) ASCII letters (a-z, A-Z), digits (0-9), and underscore are kind of the
lowest common demoninator. When you allow Unicode identifiers, you don't
"internationalize" your code, you're doing the exact opposite! If you're using
ä, ö, ü, ß in your API, how is a Canadian supposed to easily use it?

~~~
ubernostrum
_Unicode in identifiers is an exceptionally bad idea_

Unicode identifiers can be perfectly well-defined, and many languages have
them. The general idea is an identifier can start with any character that has
derived property XID_Start, and the remainder can be any characters that have
derived property XID_Continue.

 _different characters looking identical, or the same character encoded in
different ways_

Follow the recommendation of UAX #31 and normalize identifiers to NFKC prior
to comparing them. For example, the ligature variants in the article should
not -- if recommendations are followed -- be different identifiers. Here's
some Python (3):

    
    
        >>> import unicodedata
        >>> raw = ['vpnTrafficPort', 'vpnTraﬀicPort', 'vpnTrafﬁcPort', 'vpnTraﬃcPort']
        >>> normalized = set(unicodedata.normalize('NFKC', s) for s in raw)
        >>> normalized
        {'vpnTrafficPort'}
    

So there wouldn't be a "surprise" lurking in Python -- all four strings are
legal identifiers, but all four of them are also the _same_ identifier
(because Python applies normalization).

The real difficulty with Unicode identifiers is in places where you really
can't avoid Unicode: user inputs. Those who don't read Unicode technical
reports are doomed to suffer the moment they build, say, a user-account system
that comes into contact with the real world.

 _When you allow Unicode identifiers, you don 't "internationalize" your code_

No, you let other people _localize their code_. We (the tech world) spent
decades making everyone else learn English as a prerequisite to learning a
programming language. Now we have the ability to lessen that burden and let
the non-ASCII world (which does in fact include English!) spend less time
learning English and more time writing code. We probably should do that, even
if it seems icky to you.

~~~
continuational
So instead of having to learn English as a prerequisite to understand code
bases, you propose that we learn a dozen or more different natural languages
instead?

Can you imagine what it would be like if every library had to be released in N
different translations? Sounds like an exceptionally bad idea. A huge step
back for our field.

~~~
ubernostrum
If you take the position that every codebase everywhere must be available to
and understood by every programmer everywhere, sure, you have to settle on a
common language. Luckily, I don't think anybody takes that position, or at
least takes it seriously.

But there are plenty of monolingual dev teams out there whose shared language
isn't English and isn't written in ASCII. Why should they be forced to write
code in English if nobody who works with it is an English speaker?

I already regularly see people post questions to Stack Overflow and mailing
lists and IRC where names of classes and functions and variables are ASCII but
not English (i.e., someone writes a class called "Utilisateur", not "User").
Why not let them just use their language fully?

------
hprotagonist
“Zalgo-style” function names for dangerous or weird functions is kind of a
cute idea.

In dense scientific code there’s always been a place for special characters.
(apl, anyone?) To the right kind of programmer, using the proper mathematical
notation for operations and variables in a function can greatly improve
readability and clarity.

Unicode for use in theorem proving languages also has a role — agda comes to
mind here.

------
rurban
I made a similar post when I actually fixed unicode identifiers for cperl (the
perl5 replacement), according to the unicode security recommendations which
everybody else besides me and java chose to ignore. Most languages are
horribly insecure. But I eventually complained to rust, and they are adopting
these recommendations.

[http://perl11.org/blog/unicode-
identifiers.html](http://perl11.org/blog/unicode-identifiers.html)

It's not fun, it's a risk.

------
dep_b
Perhaps non-ASCII characters in anything but strings should simply be banned
by Swiftlint.

------
jerrre
I wonder how much Auto-complete could save you here.

