
UTF-8 Everywhere (2012) - thefox
http://utf8everywhere.org/
======
Animats
The Python problem is amusing. Python 3 has three representations of strings
internally (1-byte, 2-byte, and 4-byte) and promotes them to a wider form when
necessary. This is mostly to support string indexing. It probably would have
been better to use UTF-8, and create an index array for the string when
necessary.

You rarely need to index a string with an integer in Python. FOR loops don't
need to. Regular expressions don't need to. Operations that return a position
into the string could return an opaque type which acts as a string index. That
type should support adding and subtracting integers (at least +1 and -1) by
progressing through the string. That would take care of most of the use cases.
Attempts to index a string with an int would generate index arrays internally.
(Or, for short strings, just start at the beginning every time and count.)

Windows and Java have big problems. They really are 16-bit char based. It's
not Java's fault; they standardized when Unicode was 16 bits.

~~~
johncolanduoni
I think it's even better to take this one step further and have your default
"character" actually be a grapheme[1]. In almost any case where you're dealing
with individual character boundaries you want to split things on the grapheme
level, not the code-point level.

This doesn't matter much for (normalized) western European text, but if the
language in question needs to use separate diacritical code points you'll
likely end up with hanging accents in the like. Swift is the only language I
know of that has grapheme clusters as the default unit of character, I'd love
to see it in more places.

[1]:
[http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries](http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries)

~~~
MrBuddyCasino
Never understood that either. Why is this so rare? Even in technical
discussions like this one, some people will look at you funny upon hearing
this suggestion.

~~~
jcranmer
Navigating a UTF-8 string on codepoint level is a fairly simple algorithm,
since UTF-8 is self-synchronizing. This means it can easily be done without
relying on external libraries or data files. It's also stable with respect to
Unicode version--it always produces the same result independent of what
version of the Unicode tables you use.

Moving to grapheme cluster boundaries means that the algorithm may work
incorrectly if you input a string of Unicode N+1 to an implementation that
only supports Unicode N. It also makes the "increment character" function very
complicated. In the UTF-8 version, this looks roughly like:

    
    
        char *advance(char *str) {
          uint8_t c = (uint8_t)*str;
          /* Count the number of leading 1's */
          int num1s = __builtin_clz(~c) - 24;
          if (num1s == 0) return str + 1;
          return str + num1s;
        }
    

Grapheme-based indexing looks like this:

    
    
        char *advance_grapheme(char *str) {
          while (true) {
            uint32_t codepoint = read_codepoint(str);
            str = advance(str);
            uint32_t nextCodepoint = read_codepoint(str);
            /* This is typically something like table[table2[codepoint >> 4] * 16 + codepoint & 15]; */
            GraphemeClusterBreak left = lookupProp(codepoint);
            GraphemeClusterBreak right = lookupProp(nextCodepoint);
            /* Several rules based on left versus right... */
          }
          return str;
        }
    

See the vast difference in the two implementations? It's a lot of complexity,
and it's worth asking if that complexity needs to be built into the main
library (strings are a fundamental datatype in any language). It's also
important to note that it's questionable whether such a feature implemented by
default is going to actually fix naive programmers' code--if you read UTR #29
carefully, you'll notice that something like क्ष will consist of two grapheme
clusters (क् and ष), which is arguably incorrect. Internationalization is
often tied heavily to GUI and, especially for problems like grapheme clusters,
it arguably makes more sense for toolkits to implement and deal with the
problems themselves and provide things like "text input widget" primitives to
programmers rather than encouraging users to try to implement it themselves.

~~~
mtviewdave
History has shown that, when it comes to strings, developers have a hard time
getting even something as simple as null-termination correct. If grapheme
handling is complex, that's an argument for having it implemented by a small
team of experts exactly once. The resulting abstraction might not be leak-
proof, but then no abstraction is.

------
wcoenen
It's interesting how history seems to have repeated itself with UTF-16. With
ASCII and its extensions, we had 128 "normal" characters and everything else
was exotic text that caused problems.

Now with UTF-16, the "normal" characters are the ones in the basic
multilingual plane that fit in a single UTF-16 code point.

~~~
mark-r
It's worse. With UTF-8, if you're not processing it properly it becomes
obvious very quickly with the first accented character you encounter. With
UTF-16 you probably won't notice any bugs until someone throws an emoticon at
you.

~~~
ridiculous_fish
Unfortunately not. It's easy to process UTF-8 such that you mishandle certain
ill-formed sequences that you are unlikely to encounter accidentally. IIS was
hit [1], Apache Tomcat was hit [2], PHP was hit twice [3] [4].

UTF-16 has its own warts, but invalid code units and non-shortest forms are
exclusive to UTF-8.

[1] [http://www.sans.org/security-resources/malwarefaq/wnt-
unicod...](http://www.sans.org/security-resources/malwarefaq/wnt-unicode.php)

[2] [http://cve.mitre.org/cgi-
bin/cvename.cgi?name=CVE-2008-2938](http://cve.mitre.org/cgi-
bin/cvename.cgi?name=CVE-2008-2938)

[3]
[https://www.cvedetails.com/cve/CVE-2009-5016/](https://www.cvedetails.com/cve/CVE-2009-5016/)

[4]
[https://www.cvedetails.com/cve/CVE-2010-3870/](https://www.cvedetails.com/cve/CVE-2010-3870/)

------
chillacy
This article was from 4 years ago. Since then, utf 8 adoption has increased
from 68% to 87% of the top 10 million websites on Alexa:

[https://w3techs.com/technologies/history_overview/character_...](https://w3techs.com/technologies/history_overview/character_encoding/ms/y)

~~~
Const-me
_Unicode_ adoption increased to 87%. At the cost of non-Unicode encodings.

UTF16 isn’t good enough for web: even for a content in Ukrainian or Hebrew
languages, UTF8 saves a sizeable bandwidth because spaces, punctuation marks,
newlines, digits, English-inspired HTML tags — in UTF8 they all encode in 1
byte per character, and for the web, bandwidth matters.

~~~
chillacy
> _Unicode_ adoption increased to 87%. At the cost of non-Unicode encodings.

Am I reading that site incorrectly? It says: UTF-8: 87.2%, not unicode.

Then down below:

" The following character encodings are used by less than 0.1% of the
websites"

UTF-16

[https://w3techs.com/technologies/overview/character_encoding...](https://w3techs.com/technologies/overview/character_encoding/all)

------
IvanK_net
When you create a table in MySQL, a text attribute (VARCHAR etc.) is not
encoded in UTF8 by default.

I think UTF8 should be the default and only format for storing text attributes
in all databases and all other text encodings should be removed from database
systems.

~~~
zeta0134
We can't even convince Microsoft, Apple, and everything else Unix based to
agree on line endings. How on earth are we going to convince everyone that one
character encoding format is the only way they should store their data?

Annoying as it is to deal with, our history as computer scientists demands
that we maintain compatibility with older systems and encoding formats that
were once used but are now almost forgotten. If we removed all the other
encoding formats (code paths that, while underused, still function perfectly
fine) we would lose the ability to parse and manipulate a lot of old data.

~~~
Yaggo
The universal line ending character is \n. (Except in Microsoft's universe,
but that's never been compatible with the rest.)

~~~
skrause
Not true, most of the plain text protocols like SMTP, FTP, HTTP/1.1 etc. also
mandate \r\n.

------
yuhong
I have the feeling that back in 1990, ISO 10646 wanted 32-bit characters but
had no software folks on that committee, while the Unicode people was
basically software folks but thought that 16-bit was enough (this dates back
to the original Unicode proposal from 1988). UTF-8 was only created in 1992,
after the software folks rejected the original DIS 10646 in mid-1991.

------
mangix
this seems specific to Windows. UTF8 is already standard in Linux and the web
for example. It's just Microsoft.

~~~
jfries
An interesting suggestion they make is to keep utf-8 also for strings internal
to your program. That is, instead of decoding utf-8 on input and encode utf-8
on output, you just keep it encoded the whole time.

~~~
PeterisP
What would be a good alternative for strings internal to your program?

I work with multilingual text processing applications, and I strongly support
that concept. A guideline of "use UTF8 or die" works well and avoids lots of
headaches - it is the most efficient encoding for in-memory use (unless you
work mostly with Asian charsets where UTF16 has a size advantage) and it is
compatible with all legal data, so it's quite effective to have a policy that
100% of your functions/API/datastructures/databases pass _only_ UTF8 data, and
when other encodings are needed (e.g. file import/export) then at the very
edge of your application the data is converted to that something else.

Having a mix of encodings is a time bomb that sooner or later blows up as
nasty bugs.

~~~
ridiculous_fish
Abstraction is the alternative. Design an API that treats encodings uniformly,
and the encoding becomes an internal implementation detail. You can then have
a polymorphic representation that avoids unnecessary conversions. NSString and
Swift String both work this way.

------
voaie
May be off-topic, I wonder if anyone is planning a redesign of Unicode for the
far future? or is there a better way to handle characters, so we don't require
a giant library like ICU?

~~~
PeterisP
If you want to handle characters by anything much simpler than current
Unicode, you need to simplify the reality that Unicode describes, changing or
eliminating a bunch of major human languages. Not all of them, and not even
most of them, but still hundreds of millions of people would need to change
how they use their language.

It could happen in a century or two, actually, we are seeing some language
trends that do favor internationalization and simplification over localization
and keeping with linguistic tradition.

~~~
vorg
Simplication (caused by internationalization) and diversification (caused by
localization) are two ends of a spectrum, but languages, both their spoken and
written forms, have bounced between those ends throughout history. In a
century or two, by the time simplification has succeeded on Earth, the
settlers on Titan will rebel with their own graphical symbols for displaying
language.

------
Const-me
> In both UTF-8 and UTF-16 encodings, code points may take up to 4 bytes.

Wrong: up to 4 bytes UTF16, and up to 6 bytes UTF8.

> Cyrillic, Hebrew and several other popular Unicode blocks are 2 bytes both
> in UTF-16 and UTF-8.

Cyrillic, Hebrew and several other languages still have spaces and
punctuation, that take a single byte in UTF8. Now it’s 2016, RAM and storage
are cheap and declining, but CPU branch misprediction cost is same 20 cycles
and not going to decline.

> plain Windows edit control (until Vista)

Windows XP is 14 years old, and now in 2016 it’s market share is less then 3%.
Who cares what was before Vista?

> In C++, there is no way to return Unicode from std::exception::what() other
> than using UTF-8.

The exception that are part of STL don’t return Unicode at all, they are in
English.

If you throw your custom exceptions, return non-English messages in
exception::what() in utf-8, catch std::exception and call what() — you’ll get
English error messages for STL-thrown exceptions, and non-English error
messages for your custom exceptions.

I’m not sure mixing GUI languages in a single app is always a right thing.

> First, the application must be compiled as Unicode-aware

The oldest visual studio I have installed is 2008 (because I sometimes develop
for WinCE). I’ve just created a new C++ console application project, and by
default it already Unicode-aware.

So, for anyone using Microsoft IDE, this requirement is not a problem.

~~~
d0mine
Modern UTF-8 is limited by 4 bytes (not 6).
[http://stackoverflow.com/questions/9533258/what-is-the-
maxim...](http://stackoverflow.com/questions/9533258/what-is-the-maximum-
number-of-bytes-for-a-utf-8-encoded-character)

I haven't checked your other claims but this stands out:

> The exception that are part of STL don’t return Unicode at all, they are in
> English.

Do you mean they return the text as bytes using some (likely ASCII) character
encoding and all the text characters are in ASCII range?

 _There Ain 't No Such Thing As Plain Text._ (2003)
[http://www.joelonsoftware.com/articles/Unicode.html](http://www.joelonsoftware.com/articles/Unicode.html)

~~~
Const-me
> Do you mean they return the text as bytes using some (likely ASCII)
> character encoding and all the text characters are in ASCII range?

If you rely on std::exception::what() while building a localizable software,
you’ll end with inconsistent GUI language. Because some exceptions (that are
part of STL) will return English messages, other exceptions (that aren’t part
of STL) will return non-English messages.

This means if you’re developing anything localizable, you can’t rely on
std::exception::what().

Then why care about it’s prototype?

~~~
ybungalobill
The standard does not specify what the standard exceptions return from what().
It does not have to be in English.

Why care about it's prototype? You may want to embed into what() unicode
strings that describe the error and came from elsewhere. E.g. a path, a URL,
an XML element id, etc. from the context the exception originated. It may be
shown to the user or written to the log. Localization is irrelevant here.

------
hackuser
Is there any application where UTF-8 isn't the best choice for long-term
(i.e., 20-200 year) forward compatibility?

~~~
niftich
Places and situations where you can't accommodate variable-length encodings.
As far as future-proofing, UTF-8 is essentially the new ASCII, in that UTF-8
will remain a backward-compatibility goal for any other format that will
succeed it.

~~~
hackuser
> As far as future-proofing, UTF-8 is essentially the new ASCII, in that UTF-8
> will remain a backward-compatibility goal for any other format that will
> succeed it.

Yes, I love that every byte transmitted on the Internet still reserves code
points for controlling teletype (or similar) machines.

------
Murk
After considering this problem in long detail in the past, I too favoured utf8
at the time.

I remember a project (circa 1999) I worked on which was a feature phone HTML
3.4 browser and email client (one of the first). The browser/ip stack handled
only ascii/code page characters to begin with. To my surprise it was decided
to encode text on the platform using utf-16. Thus the entire code base was
converted to use 16 bit code points (UCS-2). On a resource constrained
platform (~300k ram IFIRC), better, I think, would have been update the
renderer and email client to understand utf8.

Nice as it might be to have the idea that utf16, or utf32 were a "character"
it is as has been pointed out not the case, and when you look into language
you can see how it never can be that simple.

------
misnome
I quite like Swift's approach -Characters, where a character can be "An
extended grapheme cluster ... a sequence of one or more Unicode scalars that
(when combined) produce a single human-readable character.". This seems, in
practice to mean things like multibyte entries, modified entries, end up as a
single entry.

As the trade-off, directly indexing into strings is... Either not possible or
discouraged, and often relies on an opaque(?) indexing class.

The main weirdness I have encountered so far is that the Regex functions
operate only on the old, objective-c method of indexing, so a little swizzling
is required to handle things properly.

------
cm3
Offtopic, but does anyone know of a way to ensure I don't introduce non-ASCII
filenames, to ensure broad portability across systems? I've had to resort to
disabling UTF-8 on Linux to achieve that.

~~~
viraptor
What's the use case? Make sure you don't introduce them as a desktop user? As
a app developer? (what does the app do?) As a sysadmin with third party
unknown apps?

You can't really "disable utf-8" on Linux. You can change how things are
encoded when displaying or saving. (via locale/lang variables) But if the app
wants to create a file named "0xE2 0x98 0x83" (binary version of course), it's
still free to do that.

~~~
cm3
I just don't want garbage file names when sharing a file system between
systems that don't agree on the encoding. I was thinking maybe some mount
option. I can use ISO-88591 and skip UTF-8. I haven't found a mount option for
ext4 or xfs yet.

~~~
viraptor
There isn't one. The names in ext4 and xfs are opaque binary with some simple
limitations (like null bytes). Encodings simply don't exist at the fs layer.

You could probably write some filter using fusefs, but in practice... I think
you should configure the servers / clients to agree on encoding instead.
Better supported and shouldn't be that much work.

------
wrp
This militancy to _force_ everyone to use UTF-8 is bad engineering. I'm
thinking of GNOME 3, where you aren't even allowed the option of choosing
ASCII as a default setting, only UTF-8 or ISO-8859-x. A default setting is
just as important for what it filters out as for what it passes through. I use
a lot of older tools on *nix that are ASCII-only, in tool chains that slurp
and munge text. If the chain includes any of these UTF-8-only apps, I'm
constantly dealing with the problem of invalid ASCII passing through.

------
codeulike
With or without a BOM?

~~~
mcpherrinm
Putting a BOM in UTF-8 is just silly. Unlike -16, there's no option for which
order you put the bytes in. The only time you'll see a BOM in UTF-8 is in
poorly converted UTF-16.

~~~
slavik81
Apparently, Powershell requires a BOM to recognize UTF-8 scripts.
[https://github.com/chocolatey/choco/wiki/CreatePackages#char...](https://github.com/chocolatey/choco/wiki/CreatePackages#character-
encoding)

------
douche
Some days, I imagine a parallel universe, where the ancient Chinese had called
ideograms a bad idea, and went on to develop a proper alphabet. Unicode would
be pretty much unnecessary.

