
UTF-8 Everywhere - angersock
http://www.utf8everywhere.org/
======
jbk
This resonate so much for me, in VLC.

VLC has a very large number of users on Windows (80% of our users), yet almost
none of the dev use Windows to code. Therefore, we use UTF-8 char* everywhere,
notably in the core. We use UTF-16 conversions only in the necessary Windows
modules, that use Windows APIs. Being sure we were UTF-8 everywhere took a lot
of time, tbh...

But the worse are formats like ASF (WMV) or MMS that use UTF-16/UCS-2 (without
correctly specifying) them and that we need to support on all other platforms,
like OSX or Linux...

~~~
username42
I am also convinced that wide char should be avoided and are one of the bad
thing of java. It is very nice to see that more and more people agree and that
important projects like VLC already apply this principle. Will 2014 become the
year we kill wide chars ?

------
asgard1024
This may be tangential, but I think that computer languages should have a
different type (and literal notation) for human text (strings that may be read
by human, may be translated, won't affect program semantics) and for computer
string (strings that are strictly defined, not to be translated, and may
affect program semantics).

Then we could put all the human language problems into human text type, and
leave the simpler computer string type with easier semantics.

In Python, although there are no tools for that, I typically use the following
convention: single quotes for computer text and double quotes for human text.
I guess you could use byte arrays for computer text as well, but it would be
more painful.

~~~
cynwoody
In Go, you can name your variables in Chinese† if you want.

In Go, strings are immutable UTF-8 byte arrays,†† and the language provides
facilities for iterating over them either byte by byte or rune by rune (a rune
is an int32, wide enough to hold any unicode character).

†[http://play.golang.org/p/skyEYhxWg3](http://play.golang.org/p/skyEYhxWg3)

††[http://blog.golang.org/strings](http://blog.golang.org/strings)

~~~
dbaupp
Does Go provide functions for iterating grapheme by grapheme?

~~~
cynwoody
That's being worked on†. It hasn't made it into the standard package library
as far as I know.

E.g., there are two ways to write Cañyon City. You can write the ñ as U+00F1
or as an ascii lower-case n followed by a combining tilde (U+0303). The first
case results in a single rune, and the second in two runes. Example††. You
need additional logic in order to normalize to a canonical representation and
realize that the two strings are actually the same.

Also, if you are displaying the string, you need to account for the fact that,
although the two strings have different byte and rune lengths, they take up
exactly the same number of pixels on your display medium.

†[http://blog.golang.org/normalization](http://blog.golang.org/normalization)

††[http://play.golang.org/p/XJPydELZ6s](http://play.golang.org/p/XJPydELZ6s)

~~~
coldtea
> _E.g., there are two ways to write Cañyon City. You can write the ñ as
> U+00F1 or as an ascii lower-case n followed by a combining tilde (U+0303).
> The first case results in a single rune, and the second in two runes.
> Example††. You need additional logic in order to normalize to a canonical
> representation and realize that the two strings are actually the same._

Who thought that having two ways to go about this was a good idea in the first
place?

------
huhtenberg
Now, all the advice in the Windows section - don't do this, don't do that,
only and _always_ do third - is lovely, but if you happen to care about app's
performance, you _will_ have to carry wstrings around.

Take a simple example of an app that generates a bunch of logs that need to be
displayed to the user. If you are to follow article's recommendations, you'd
have these logs generated and stored in UTF8. Then, only when they are about
to be displayed on the screen you'd convert them to UTF16. Now, say, you have
a custom control that renders log entries. Furthermore, let's imagine a user
who sits there and hits PgUp, PgDown, PgUp, PgDown repeatedly.

On every keypress the app will run a bunch of strings through
MultiByteToWideChar() to do the conversion (and whatever else fluff that comes
with any boost/stl wrappers), feed the result to DrawText() and then discard
wstrings, triggering a bunch of heap operation along the way. And you'd better
hope latter doesn't cause heap wobble across a defrag threshold.

Is your code as sublime as it gets? Check. Does it look like it's written by
over-enlightened purists? You bet. Just look at this "advice" from the page -

    
    
      ::SetWindowTextW(widen("string litteral").c_str())
    

This marvel passes a constant string to widen() to get another constant string
to pass to an API call. Just because the _code_ is more kosher without that
goddamn awful L prefix. Extra CPU cycles? Bah. A couple of KB added to the
.exe due to inlining? Who cares. But would you just look at how zen the code
is.

    
    
      --
    

tl;dr - keeping as much text as possible in UTF8 in a Windows app is a good
idea, but just make sure not to take it to the extremes.

~~~
guard-of-terra
"if you happen to care about app's performance, you will have to carry
wstrings around"

If those strings are for the user to read, he's reading a million times slower
than you handle the most ornate reencoding. Sounds like a premature
optimization.

~~~
eps
It's not a premature optimization. It's a manifestation of a different set of
coding ethics which is just ... err ... less wasteful and generally more
thoughtful.

~~~
TeMPOraL
Yup. I'd wish this ethics was more popular. I can understand that we "waste"
countless cycles in order to support abstraction layers that help us code
faster and with less bugs. But I think that our programs could still be an
order of magnitude faster (and/or burn less coal) if people thought a little
bit more and coded a little bit slower. The disregard people have for writing
fast code is terrifying.

Or maybe it's just me who is weird. I grew up on gamedev, so I feel bad when
writing something obviously slow, that could be sped up if one spent 15
minutes more of thinking/coding on it.

~~~
brazzy
Yeah, I'll have to disagree with both of you. The "coding ethics" that wants
to optimze for speed everywhere is the wasteful and thoughtless one.

Computers are fast, you don't have to coddle them. _Never_ do any kind of
optimization that reduces readability without concrete proof that it will
actually make a difference.

15 minutes spent optimizing code that takes up 0.1% of a program's time are 15
wasted minutes that probably made your program worse.

Additionally: "Even good programmers are very good at constructing performance
arguments that end up being wrong, so the best programmers prefer profilers
and test cases to speculation."(Martin Fowler)

~~~
huhtenberg
> _Computers are fast, you don 't have to coddle them_

This mentality is exactly why Windows feels sluggish in comparison to Linux on
the same hardware. Being careless with the code and unceremoniously relying on
spare (and frequently assumed) hardware capacity is certainly a way to do
things. I'm sure it makes a lot of business sense, but is it a good
_engineering_? It's not.

~~~
brazzy
Neither is optimization for its own sake, it's just a different (and worse)
form of carelessness and bad engineering.

Making code efficient _is not a virtue in its own right_. If you want
performance, set measurable goals and optimize the parts of the code that
actually help you achieve those goals. Compulsively optimizing everything will
just waste a lot of time, lead to unmaintainable code and quite often _not_
actually yield good performance, because bottlenecks can (and often do) hide
in places where bytes-and-cycles OCD overlooks them.

~~~
huhtenberg
I think we are talking about different optimizations here. I'm referring to
"think and use qsort over bubblesort" kind of thing while you seem to be
referring to a hand-tuned inline assembly optimizations.

My point is that the "hardware can handle it" mantra is a tell-tale site of a
developer who is more concerned with his own comforts than anything else. It's
someone who's content with not pushing _himself_ and that's just wrong.

    
    
      --
    

(edit) While I'm here, do you know how to get an uptime on Linux?

    
    
      cat /proc/uptime
    

Do you know how to get uptime on Windows? WMI. That's just absolutely f#cking
insane that I need to initialize COM, instantiate an object, grant it required
privileges, set up a proxy impersonation only to allow me send an RPC request
to a system service (that may or may not be running, in which case it will
take 3-5 seconds to start) that would on my behalf talk to something else in
Windows guts and then reply with a COM variant containing an answer. So that's
several megs of memory, 3-4 non-trivial external dependencies and a second of
run-time to get the uptime.

Can you guess why I bring this up?

Because that's exactly a kind of mess that spawns from "oh, it's not a big
overhead" assumption. Little by little crap accumulates, solidifies and you
end up with this massive pile of shitty negligent code that is impossible to
improve or refactor. All because of that one little assumption.

~~~
danellis
You make WMI sound long-winded, but do you think 'cat /process/uptime' is
free? There's a lot involved in opening a file.

~~~
guard-of-terra
On the process side, it's a few system calls, and operating system always have
this code at hand, does not need to load anything (that's what slow).

------
Pxtl
I was horrified to discover that Microsoft SQL Server's text import/export
tools don't even support UTF-8. Like, at all. You can either use their
bastardized wrongendian pseudo-UTF-16, or just pick a code-page and go pure
8-bit.

~~~
josteink
I'm not sure what modules or tools you are talking about, but if you use Sql
Server Integration Services (formerly SQL Servrer Data Transformation
Services), you basically have a data processing pipeline which supports
everything and all transformations on the planet.

And obviously it supports arbitrary text-encodings, although sometimes you
will need to be explicit about it.

If you used the simplified wizards, all the options may not have been there,
but you should have been given the option to export/save the job as a package,
and then you can open, modify, test and debug that before running the job for
real.

Seriously. SQL Server has some immensely kick-ass and über-capable tooling
compared to pretty much every other database out there.

To even suggest it doesn't support UTF8 is ludicrous.

~~~
Someone
If SQL server supports UTF8, Microsoft manages to hide that fact well.
[http://technet.microsoft.com/en-
us/library/ms176089.aspx](http://technet.microsoft.com/en-
us/library/ms176089.aspx):

 _char [ ( n ) ] Fixed-length, non-Unicode string data._

[http://technet.microsoft.com/en-
us/library/ms186939.aspx](http://technet.microsoft.com/en-
us/library/ms186939.aspx)

 _Character data types that are either fixed-length, nchar, or variable-
length, nvarchar, Unicode data and use the UNICODE UCS-2 character set._

So, (var)char is "non-Unicode", and n(var)char is UCS-2 only.

That is in agreement with
[http://blogs.msdn.com/b/qingsongyao/archive/2009/04/10/sql-s...](http://blogs.msdn.com/b/qingsongyao/archive/2009/04/10/sql-
server-and-utf-8-encoding-1-true-or-false.aspx), which claims the glass is
half full ( _" In summary, SQL Server DOES support storing all Unicode
characters; although it has its own limitation."_)

On the other hand, we have [http://msdn.microsoft.com/en-
us/library/ms143726.aspx](http://msdn.microsoft.com/en-
us/library/ms143726.aspx) that seems to state that SQL Server 2012 has proper
unicode collations. UTF8 still is nowhere to be found, though.

~~~
josteink
To be fair, the format in which data is stored in the DB and the format used
for importing data are two entirely different things.

If you want to treat data as a stream of bytes hardcore UTF8 & PHPesque style
(this function is "binary safe" woo) with no regard to the actual _text_
involved, feel free to store it a bytes. SQL Server supports that.

If you want to store it as _unicode text_ feel free to use the ntext and
nvarchar types. I'm pretty sure that's what you intend to do anyway, even
though you insist on calling it UTF8.

~~~
Someone
I'm not the original complainer about UTF8 support, but _" If you want to
store it as unicode text feel free to use the ntext and nvarchar types."_
comes at a price: for the o so common almost-ASCII text collections, it blows
up your disk usage and I/O bandwidth for actual data by a factor of almost 2.
For shortish fields, the difference probably isn't that, but if you store,
say, web pages or blog posts, it can add up.

------
vorg
I don't like the way UTF-8 was clipped to only 1 million codepoints in 2003 to
match the UTF-16 limit. The original 2.1 billion codepoint capacity of the
original 1993 UTF-8 proposal would've been far better. Go Lang uses \Uffffffff
as syntax to represent runes, giving the same upper limit as the original
UTF-8 proposal, so I wonder if it supports, or one day will support, the
extended 5- and 6-byte sequences.

In fact, UTF-16 doesn't really have the 1 million character limit: by using
the two private-use planes (F and 10) as 2nd-tier surrogates, we can encode
all 4-byte sequences of UCS-32, and all those in the original UTF-8 proposal.

I suspect the reason is more political than technical. unicode.org
([http://www.unicode.org/faq/utf_bom.html#utf16-6](http://www.unicode.org/faq/utf_bom.html#utf16-6))
says " _Both Unicode and ISO 10646 have policies in place that formally limit
future code assignment to the integer range that can be expressed with current
UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e. other UTFs) can
represent larger intergers, these policies mean that all encoding forms will
always represent the same set of characters. Over a million possible codes is
far more than enough for the goal of Unicode of encoding characters, not
glyphs. Unicode is not designed to encode arbitrary data._ "

~~~
kzrdude
What use would it have to have so much extra codepoint space?

~~~
vorg
2 planes (130,000) of private-use codepoints aren't enough, and because the
top 2 planes of Unicode are designated private use, UTF-16 gives developers
the option of extending them to 2.1 billion if they need it. I've wanted extra
private-use space for generating Unihan characters by formula in the same way
the 10,000 Korean Hangul ones are generated from 24 Jamo. I'm sure many other
developers come across other scenarios where 130,000 isn't enough for private
use.

I'm simply saying that UTF-8 shouldn't be crippled in the Unicode/ISO spec to
21 bits, but be extended to 31 bits as originally designed because the
technical reason given (i.e. because UTF-16 is only 21 bits) isn't actually
true. The extra space should be assigned as more private use characters.
(Except of course the last two codepoints in each extra plane would be
nonchars as at present, and probably also the entire last 2 planes if the 2nd-
tier "high surrogates" finish at the end of a plane.)

------
jasonjei
We constantly have to deal with Win32 as a build platform and we write our
apps natively for that platform using wchar. I think the main difficulty is
that most developers hate adding another library to their stack, and to make
matters worse, displaying this text in Windows GUI would require conversion to
wchar. That's why I think they are up for a lot of resistance, at least in the
Windows world. If the Windows APIs were friendlier to UTF-8, there might be
hope. But as it stands right now, using UTF-8 requires the CA2W/CW2A macros,
which is just a lot of dancing to keep your strings in UTF-8 which ultimately
must be rendered in wchar/UTF-16.

Maybe there might be a shot in getting developers to switch if Windows
GUIs/native API would render Unicode text presented in UTF-8. But right now,
it's back to encoding/decoding.

------
randomfool
"This is what made UTF-8 the favorite choice in the Web world, where English
HTML/XML tags are intermixed with any-language text."

Except that Javascript is UTF-16, so no luck with 4 byte chars there.

~~~
lisper
> Javascript is UTF-16

No it isn't. Javascript is no different from any other text. It can be encoded
in any encoding. Where did you get the idea that JS is UTF-16?

EDIT: I misunderstood the intent of the comment I was responding to. JS uses
(unbeknownst to me) UTF-16 as its internal representation of strings.

~~~
thurn
GP means string literals. To quote from the spec: "4.3.16 String value:
primitive value that is a finite ordered sequence of zero or more 16-bit
unsigned integer... Each integer value in the sequence usually represents a
single 16-bit unit of UTF-16 text."

~~~
jgraham
The "usually" there turns out to be important.

Javascript "strings" are, as the spec says, just arrays of 16 bit integers
internally. Since Unicode introduced characters outside the Basic Multilingual
Plane (BMP) i.e. those with codepoints greater than 0xFFFF it has no longer
been possible to store all characters as a single 16 bit integer. But it turns
out that you _can_ store non-BMP character using a pair of 16 bit integers. In
a UTF-16 implementation it would be impossible to store one half of a
surrogate pair without the other, indexing characters would no longer be O(1)
and the length of a string would not necessarily be equal to the number of 16
bit integers, since it would have to account for the possibility of a four
byte sequence representing a single character. In javascript none of these
things are true.

This turns out to be quite a significant difference. For example it is
impossible in general to represent a javascript "string" using a conforming
UTF-8 implementation, since that will choke on lone surrogates. If you are
building an application that is supposed to interact with javascript — for
example a web browser — this prevents you from using UTF-8 internally for the
encoding, at least for those parts that are accessible from javascript.

------
belluchan
And software developers, don't forget to implement the 4 byte characters too
please. Utter nightmare dealing with MySQL. I believe 4 byte characters still
even break github comments.

~~~
90minuteAPI
About 2 weeks I tried to file a bug with our backend guys about 4-byte
characters wreaking havoc on our API.

My example broke the bug tracker's (bugzilla) comment system as well. I
chuckled.

~~~
MBCook
Yeah. We've noticed, to our own amusement, that Jira (we're on an older
version) can't handle non-ASCII. Makes entering tickets involving other
languages fun.

------
elwell
I can only imagine what kind of frustration drove someone to make this site.

~~~
SomeCallMeTim
The same frustration made me write a custom string library for the Playground
SDK, years ago.

std::string is missing a lot of functionality one tends to need when dealing
with strings (such as iterating over UTF-8 characters or fast conversion to
UTF-16, but also things like search-and-replace). And it makes me sad that I
can't use that string library any more (legally) because of the license
PlayFirst insisted on using (no redistribution).

As far as I'm concerned, though, there IS no good string library available for
use anywhere. I've looked at all of the ones I could find, and they're all
broken in some fundamental way. I guess solving the "string problem" isn't
sexy enough for someone to release a library that actually hits all the pain
points.

~~~
klmr
You might like ogonek[1].It’s currently still not implementing regular
expressions and thus strongly limited in its capabilities (and C++11 regex are
so badly designed that they cannot be extended meaningfully to handle this
case), but it has hands down the best API for working with text in C++. It
makes using a wrong encoding a compile-time error and offers effortless ways
of dealing with actual Unicode entities (code points, grapheme clusters)
rather than bytes.

[1] [https://github.com/rmartinho/ogonek](https://github.com/rmartinho/ogonek)

~~~
Danieru
You've sparked my curiosity: what's wrong with C++11's regex?

I've used them over the summer but nothing felt broken beyond the general C++
verbosity. Granted most of my prior regex work was in Perl.

------
nabla9
UTF-8 is usually good enough in disk.

I would like to have at least two options in memory: utf-8 and vector of
displayed characters (there's many combinations in use in existing modern
languages with no single-character representations in UTF-<anything>).

~~~
Guvante
Do you need a vector of displayed characters?

Usually all you care about is the rendered size, which your rendering engine
should be able to tell you. No need to be able to pick out those characters in
most situations.

~~~
nabla9
Yes. If I want to work with language and do some stringology, that's what I
want. I might want to swap some characters, find length of words etc. To have
vector of characters (as what humans consider characters) is valuable.

~~~
pornel
> To have vector of characters (as what humans consider characters) is
> valuable.

That might be an awful can of worms. Are Arabic vowels characters? "ij" letter
in Dutch? Would you separate Korean text into letters or treat each block of
letters as a character?

~~~
sanxiyn
I can answer the question on Korean. Treat each block of letters as a
character. Never ever separate for human uses.

------
optimiz3
Most of the post talks about how Windows made a poor design decision in
choosing 16bit characters.

No debate there.

However, advocating "just make windows use UTF8" ignores the monumental
engineering challenge and legacy back-compat issues.

In Windows most APIs have FunctionA and FunctionW versions, with FunctionA
meaning legacy ASCII/ANSI and FunctionW meaning Unicode. You couldn't really
fix this without adding a 3rd version that was truly UTF-8 without breaking
lots of apps in subtle ways.

Likely it would also only be available to Windows 9 compatible apps if such a
feature shipped.

No dev wanting to make money is going to ship software that only targets
Windows 9, so the entire ask is tough to sell.

Still no debate on the theoretical merits of UTF-8 though.

~~~
angersock
Nothing worth doing is easy.

Anyways, the FunctionA/FunctionW is usually hidden behind a macro anyways (for
better or worse). This could simply be yet another compiler option.

------
BadassFractal
Would be lovely if MS Office could export CSV to UTF-8, but nope.

~~~
frik
Yes, I had issues with that in Excel 2010 too.

The only well known Microsoft application that can handle UTF-8 is notepad.exe
(Win 7).

------
wehadfun
I admire and appreciate your concern for something that is missunderstood and
ignored. However this webpage took way to long to say what is so great about
utf 8.

~~~
jasonjei
Honestly, I think it is platform politics. *nix systems seem to prefer UTF-8,
while UTF-16 is the default on Windows. Space and memory are cheap, so either
encoding seems fine.

The bottom line is that UTF-8 is awkward to use on Windows, while
UTF-16/wchar_t is awkward to use on Linux, simply because the core APIs make
them so (there is no _wfopen function in glibc).

~~~
MichaelGG
It's not really politics. Microsoft made the choice for fixed-sized chars back
when it was thought that 16 bits was enough for everyone. MS was at the
forefront of internationalizing things, and probably still are. (Multilanguage
support in Windows and Office is quite top class.)

Unfortunately, we need more than 16 bits of codepoints, so 16-bit chars is a
waste and a bad decision with that insight. It seems unlikely that a fresh
platform with no legacy requirements would choose a 16-bit encoding. Think of
all the XML in Java and .NET - all of it nearly always ASCII, using up double
the RAM for zero benefit. It sucks.

Was UTF-8 even around when Microsoft decided on 16-bit widechar?

Other platforms seem to have lucked out by not worrying as much as
standardizing on a single charset and UTF8 came in and solved the problems.

~~~
Someone
_" Was UTF-8 even around when Microsoft decided on 16-bit widechar?"_

No, Thompson's placemat is from September 1992 and NT 3.1 from July 1993, but
development on NT started in November 1989
([http://en.wikipedia.org/wiki/Windows_NT#Development](http://en.wikipedia.org/wiki/Windows_NT#Development))

------
mattfenwick
I think it's unfortunate that it doesn't have more concrete examples. I think
having more of those would really help strengthen their case, clarify their
points, and make their arguments tangible and understandable to a much wider
audience.

One instance where I really wish for examples: they mention characters, code
points, code units, grapheme clusters, user-perceived characters, fonts,
encoding schemes, multi-byte patterns, BE vs LE, BOM, .... while I kind of get
some of these, I certainly don't understand all of them in detail, and so
there's no way that I'll grasp the subtleties of their complicated
interactions. Examples, even of simple things such as what actually gets saved
to disk when I write out a string using UTF-8 encoding vs. UTF-16 --
especially when using higher codepoints, would be hugely beneficial for me.

------
chj
Windows is a horrible environment for UTF8 unless MS provides a special locale
for it.

At present state, you can choose to use utf8 internally in your app, but when
you need to cooperate with other programs (over sockets or files), it's going
to be confusing. Some will be sending you ANSI bytes and you take it as UTF8.

------
andystanton
"Even though one can argue that source codes of programs, web pages and XML
files, OS file names and other computer-to-computer text interfaces should
never have existed, as long as they do exist, text is not only for human
readers."

I'm a little confused by this statement. Can someone clarify?

~~~
claudius
I _think_ the author wants to say that a) computers _should_ use appropriate
binary formats for communication between them (e.g. from a web server to a
browser), b) they don’t and use plain text in many cases (e.g. HTML) and hence
c) text is not only read by humans but also by computers.

I am not entirely sure whether that makes any sense, though.

------
GnarfGnarf
Interesting. I came to the same conclusion myself a few years ago when
converting a Windows app to Unicode. I store all strings as UTF-8, which
enabled me to continue using strncpy, char[] etc. I convert to wchar_t only
when I need to pass the string to Win32. I can even change from narrow to
widechar dynamically. I use a global switch which tells me whether I am
running in Unicode or not, and call the 'A' or 'W' version of the Win32
function, after converting to wchar_t if necessary.

------
puppetmaster3
In Javascript, it's UTF16. Also Java.

Can't speak for other of the top.

------
angersock
What is currently the best way of dealing with UTF-8 strings in a cross-
platform manner? It sounds like widechars and std::string just won't cut it.

~~~
90minuteAPI
Yeah, you really need some specialized interface that encapsulates all the
things you may need from your string processing, but can spit out a
representation of that string in various encodings.

In line with the spirit of this article, that interface should use UTF-8
storage internally as well, but this should be transparent to the programmer
anyway. Dealing with encoded strings directly is a recipe for heartache unless
you're actually writing such a library.

------
jahewson
> there is a silent agreement that UTF-8 is the most correct encoding for
> Unicode on the planet Earth

But what about other planets? Is there a Unicode Astral Plane which may encode
poorly in the future?

~~~
taejo
There is are 13 Unicode astral planes (though only two of them have characters
assigned so far), and it they do indeed encode poorly in some environments:
the planes other than plane 0, the basic multilingual plane, are informally
known as "the astral planes".

~~~
vorg
There's 16 astral planes (U+1xxxx to U+10xxxx), of which 3 have characters
assigned...

* plane 1 is the supplementary multilingual plane

* plane 2 is the supplementary ideographic plane

* plane E is the supplement­ary special-purpose plane

* planes F and 10 are private-use planes

Perhaps you wrote from memory.

~~~
taejo
Thanks for providing the correct details. I was indeed writing from memory.

------
duaneb
Thank god for emoji.

~~~
pjscott
They're very useful if you want a test case that requires multiple bytes in
UTF-8 and multiple words in UTF-16.

~~~
duaneb
Seriously, I wasn't being sarcastic! Emoji have been the single largest
driving factor in proper unicode adoption I've seen—they're the first non-BMP
characters to see wide-spread use.

------
Dewie
IT is so Anglophile that programs can become slower if you deviate from
ASCII...

But of course being so incredibly anglocentric is not an issue, at least that
seems to be the consensus of the participants when I read discussions on the
Web where _all the people who are discussing it write English with such a
proficiency that I can 't tell who are and aren't native speakers of the
language_.

~~~
jasonjei
I'm Chinese American, and I don't agree with your statement that string
libraries are Anglophile. How would you encapsulate the 10,000 commonly used
Chinese characters? It's just the reality of having a lot of characters in a
language. Not much else you can do to speed up processing. How would you
design string storage to be faster for a language like Chinese?

English happens to be the lingua franca of Engineering. It's not about brown
nosing English-speaking countries, but about getting the widest range of
audience.

~~~
adsr
Let me ask you, with 10k commonly used characters doesn't that lead to shorter
texts? Kind of like how higher base numbers can encode larger numbers with
fewer digits, in that case the longer encoding of UTF-8 could be made up for
by using fewer characters. Or am I wrong about this assumption?

As an example, suppose that there are one character that denotes the word
'house', if that single character is encoded using five bytes it takes the
same amount of space as the english encoding.

~~~
Crito
That seems more than plausible to me. While the character 象 is two bytes
longer than the character "f", it is _five_ bytes shorter than "elephant".

IIRC the average word length in English is around 5 characters.

------
josteink
Looking at the .NET parts of the manifesto, I just have to roll my eyes:

 _Both C# and Java offer a 16 bit char type, which is less than a Unicode
character, congratulations. The .NET indexer str[i] works in units of the
internal representation, hence a leaky abstraction once again. Substring
methods will happily return an invalid string, cutting a non-BMP character in
parts._

While theoretically true, for most practical purposes, this reeks of a
USA/American/English bias and lack of real world experience.

You know what? I want to know that the _text_ "ØÆÅ" is three characters long.
I dont want to know that it's a 6-byte array once encoded to UTF8. Anywhere in
my code telling me this is 6 characters is _lying_ , not to mention a
violation of numerous business-requirements.

When I work with text I want to work with text and never the byte-stream it
will eventually be encoded to. I want to work on top of an abstraction which
lets me treat text as text.

Yes, their are cases where the abstraction will leak. But those cases are very
far and few in between. And in all cases where it doesn't, it offers me
numerous advantages over the PHPesque, amateurish and _incorrect_ approach of
treating everything as a dumb byte-array.

It's not. It's text in my program. It's text rendered on your screen. It's
just a byte-array when we send it over the wire, so stop trying to pretend
text isn't text.

This manifesto is wildly misguided.

~~~
derefr
I fundamentally agree, but I don't think even you understand the implications
of what you're saying. Though you say "characters", You're probably still
thinking about Unicode _codepoints_ here, not characters. A real _character_
-oriented API would require an arbitrary number of bytes to store each
character (a character being an arbitrary vector of codepoints), and would
require a _font_ to be associated to each region of codepoints--because it's
the font's decision of how to render a sequence of codepoints, so it's the
font that determines how many _characters_ you get. (For an edge-case example:
[http://symbolset.com/](http://symbolset.com/))

...and saying that, it'd _still_ be a good idea.

~~~
chipsy
When I learned of the difficulty mapping between code points and characters, I
realized that Unicode is a standard nobody will ever (knowingly) implement
correctly. Even if everyone has access to a font API, there'll probably be
bugs in the fonts for all eternity.

~~~
derefr
(I half-considered adding this to my comment above, but it didn't quite fit.)

If we were serious about a character-oriented API, we definitely wouldn't want
to introduce _character rendering rules_ into places like the kernel. But I
don't think we'd necessarily have to.

The best solution, I think, would be to decompose fonts into two pieces:

1\. a _character map_ (a mapping from paramaterized codepoint sequences to
single entities known to the font),

and 2. a _graphemes file_ (the way to actually draw each character.)

The graphemes file would be what people would continue to think of as "the
font." And the graphemes file would _specify_ the character map it uses, in
much the same way an XML/SGML document specifies a DTD.

As with DTDs, the text library built into the OS would have a well-known core
set of character maps built in, and allow others to be retrieved and cached
when referenced by URL. The core set would become treated something like root
CAs or timezone data are now: bundled artifacts that get updated pretty
frequently by the package manager.

