
The UTF-8-Everywhere Manifesto - bearpool
http://www.utf8everywhere.org/
======
pilif
Really good article. You'll get nothing from me but heartfelt agreement. I
especially liked that the article was giving numbers about how inefficient
UTF8 would be to store Asian text (not really apparently).

Also insightful, but obvious in hindsight: Not even in utf-32 you can index
specific character in constant time due to the various digraphs.

The one property I really love about UTF8 is that you get a free consistency
check as not every arbitrary byte sequence is a valid UTF8 string.

This is a really good help for detecting encoding errors very early (still to
this day, applications are known to lie about the encoding of their output).

And of course, there's no endianness issue, removing the need for a BOM which
makes it possible for tools that operate at byte levels to still do the right
job.

If only it had better support outside of Unix.

For example, try opening a UTF8 encoded CSV file (using characters outside of
ASCII of course) in Mac Excel (latest versions. Up until that, it didn't know
UTF8 at all) for a WTF experience somewhere between comical and painful.

If there is one thing I could criticize about UTF8 then that would be its
similarity to ASCII (which is also its greatest strength) causing many
applications and APIs to boldly declare UTF8 compatibility when all they
really can do is ASCII compatibility and emitting a mess (or blowing up) once
they have to deal with code points outside that range.

I'm jokingly calling this US-UTF8 when I encounter it (all too often
unfortunately), but maybe the proliferation of "cool" characters like what we
recently got with Emoji is likely going to help with this over time.

~~~
fleitz
"The one property I really love about UTF8 is that you get a free consistency
check as not every arbitrary byte sequence is a valid UTF8 string."

You don't get this at all using UTF-8. You only get it if you attempt to
decode the string which even something like strlen doesn't do. Strlen will
happily give you wrong answers about how many characters are in a UTF-8 string
all day long and never ever attempt to check the validity of the string. Take
your valid UTF-8 and change one of the characters to null, now it doesn't work
in many circumstances with 'UTF-8' code.

Also, should the free consistency check ever actually work you're in a bigger
pickle as you now have to figure out whether the string is wrongly encoded
UTF-8 or someone sent you extended ASCII.

I did a lot of work with unicode apps. I used to have a series of about 5
strings that I could paste into a 'UNICODE' application and have it invariably
break.

One was an extended ASCII string that happend to be valid UTF-8 sans BOM :)

One was a UTF-8 string with BOM and has 0x00 inside :) (I call this string how
to tell if it was written with C)

One was a UTF-8 string with a BOM :)

One UTF-8 string with a some common latin characters, a couple japanese, and a
character outside the BMP.

Two UTF-16 strings in LE/BE with and sans BOM.

~~~
jsprinkles
> You only get it if you attempt to decode the string which even something
> like strlen doesn't do.

Because strlen() is a count of chars in a null-terminated char[], not a
decoder. Ever. It's character set agnostic.

> Strlen will happily give you wrong answers about how many characters are in
> a UTF-8 string all day long and never ever attempt to check the validity of
> the string.

Because, again, strlen() counts chars in a null-terminated char[]. It is
giving you the _right_ answer, you are asking it the _wrong question_.

> Take your valid UTF-8 and change one of the characters to null, now it
> doesn't work in many circumstances with 'UTF-8' code.

Which means it's not a valid UTF-8 decoder, but is instead treating the buffer
as Modified UTF-8[1].

> that I could paste into a 'UNICODE' application

Clipboards or pasteboards in many operating systems butcher character set when
copying and pasting text. Generally, the clipboard cannot be trusted to do the
right thing in every circumstance. On Windows, in particular, character set
can get transposed to the system character set or something rather arbitrary
when text is copied.

> One was a UTF-8 string with BOM and has 0x00 inside :) (I call this string
> how to tell if it was written with C)

> One was a UTF-8 string with a BOM :)

Don't use the BOM[2] in UTF-8. It's recommended against.

So really, your point is that some implementations are bad, and you have a bag
of tricks for breaking implementations that don't handle all corner cases?
That's pretty universal even in the non-Unicode world; there's bad
implementations of everything. Windows is an especially bad implementation of
most things Unicode.

A valid decoder will, indeed, consistency-check an arbitrary string of bytes
as UTF-8. The OP is correct, and your corner cases don't refute his point.

[1]: <http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8>

[2]: <http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark>

~~~
fleitz
A name like strlen suggests that it's designed to take the length of a string,
if it was called count_null_ter_char_array then I'd tend to believe you. It's
not character set agonistic, it's monotheistic at the shrine of ASCII, it's
all over the coding style.

Null is valid UTF-8, it just doesn't work with C 'strings'. I can get null out
of a UTF-8 encoder with no problem.

My point is that UTF-8 is nowhere near the panacea being described and if you
have to touch the strings themselves that it's far better to use UTF-16 in the
vast majority of cases. The only time you ever really want to use UTF-8 is if
you're dealing with legacy codebases, it's a massive hack.

~~~
jeltz
I do not understand how UTF-16 could be better for this reason. wcslen works
exactly like strlen but on wide chars instead of chars.

------
gwillen
Ok, let me be the first approving top level comment: This document is correct.
The author of this document is smart. You should follow this document.

As jwz said about backups: "Shut up. I know things. You will listen to me. Do
it anyway."

------
luriel
Yes! I have been meaning to write something like this for years.

There is only one thing I would add: Never add a BOM to an UTF-8 file!! It is
redundant, useless and breaks all kinds of things by attaching garbage to the
start of your files.

Edit: Here is the interesting story of how Ken Thompson invented UTF-8:
<http://doc.cat-v.org/bell_labs/utf-8_history>

~~~
makecheck
The mark isn't useless; it clearly identifies files as UTF-8 so they can be
processed as such immediately. Otherwise a program has to "sniff" several
bytes to see if the encoding could be something different, and it may not
guess correctly.

Also, how can "all kinds of things" break with this mark? If something is
reading UTF-8 _correctly_ then it'll be fine with the mark; and if it's not
reading UTF-8 correctly then it will screw up a lot more than the mark at the
beginning of the file.

~~~
unconed
This argument is silly. Why not prefix every UTF-8 string with a BOM then?
It's wasteful and unnecessary, because UTF-8's clean structure already makes
it trivial to detect, and false positives are all but impossible for real-
world text. There's a paper out there that proves this.

The UTF-8 BOM was a Microsoft invention. Nobody else uses it, and it breaks
tons of things. Two examples off the top of my head: Unix hashbang scripts
(i.e. #!/bin/bash), and PHP scripts (the BOM will trigger HTTP header
finalization before any code is run).

~~~
makecheck
You wouldn't prefix every string with it because presumably your API or
program's state has already determined the string's encoding. I am not
suggesting that every fragment of text has to be explicit (I agree that would
be ridiculous). I am only stating facts: there is nothing incorrect about
having the mark, a conformant reader must be able to handle the mark, and the
mark has _some_ value as a short-cut for avoiding elaborate decoding tricks.

~~~
unconed
The thing is, the BOM is metadata, it doesn't belong in content. It violates
the contract of .txt files, which is: the entire file is a single string of
content.

Recognizing it at the edges of your program and stripping it out is not the
end of the world, but it's annoying and no other (8-bit) encoding works that
way. In fact, I find it hard to believe UTF-8 BOMs in MS programs were
anything more than a programmer error. Once such files were out in the wild,
everyone else had to deal with them.

~~~
makecheck
There are already plenty of cases that valid UTF-8 readers have to deal with
(unused ranges of code points, invalid byte combinations, etc.). Ignoring a
BOM is trivial by comparison. A UTF-8 reader honestly _doesn't care_ about the
"stringness" of a .txt file because of all the other crap that can be in a
byte stream.

Older programs do care, but as I've said elsewhere in the thread an ASCII file
can remain ASCII (no BOM). There's no reason to BOM-ify an old ASCII file if
it really is ASCII and only ASCII-expecting programs will ever use it.

Over time these old programs will either be upgraded or go away and it will
finally be safe to say that inputs must be UTF-8. At _that_ time, the BOM has
no reason to exist.

------
pcwalton
Sadly, the pervasiveness of JavaScript means that UTF-16 interoperability will
be needed as least as long as the Web is alive. JavaScript strings are
fundamentally UTF-16. This is why we've tentatively decided to go with UTF-16
in Servo (the experimental browser engine) -- converting to UTF-8 every time
text needed to go through the layout engine would kill us in benchmarks.

For new APIs in which legacy interoperability isn't needed, I completely
approve of this document.

~~~
lambda
Yeah, it's really sad the number of legacy APIs which have standardized on
UTF-16.

The Windows API calles UTF-16 "Unicode". Most Mac OS X APIs use UTF-16.
JavaScript and Java both use UTF-16. ICU uses UTF-16. So while UTF-8 is
technically superior in almost every way, it's going to be an uphill battle to
standardize on it.

I appreciate that new languages like Rust and Go made the choice of UTF-8 as
their native text encoding. But there's a lot of inertia for UTF-16, and I'm
not sure it'll be easy to ever get free of it.

~~~
brigade
OS X APIs generally use NSString/CFString, which hide the actual encoding of
the string; they can be any encoding at all internally.

~~~
lambda
While they could, in theory, hide the actual encoding, the APIs all refer to
UTF-16 code units as "characters;" so while they could use UTF-8 as the
internal encoding, you need to use and understand UTF-16 in order to interact
with them properly. When you ask for the "length" of a string, you are told
the number of UTF-16 code units. When you get a character at an index, you get
the UTF-16 code unit. That's what I mean when I say the APIs use UTF-16;
everything in the API that deals with individual "chracters" is actually
referring to UTF-16 code units.

The same is true of JavaScript; while you could technically implement the
strings however you want, the APIs are all oriented around UTF-16 code units.
And the Windows API as well, is all built around UTF-16 code units.

The problem with all of these APIs is that they make the mistake of conflating
characters and code units. They all make the assumption that a character
consists of a single, fixed width integer, of some given size (16 bits in the
case of UTF-16). It is better to distinguish between indexing in code units
(such as bytes in UTF-8 or 16 bit integers in UTF-16) and indexing in code
points, or glyphs, or whatever higher level concept you are talking about.
Really, for anything higher than the code unit level, you should be dealing
with variable-length strings, and not try to force that into fixed length
units. With UTF-8, there's no temptation to treat a single code unit as being
an independently meaningful entity, as that assumption breaks down as soon as
you get past the ASCII range; while with UTF-16, it's easy to make that
mistake, since it holds true for everything in the Basic Multilingual Plane,
which contains most characters you're likely to encounter on a day to day
basis.

------
cygx
Personally, I prefer UTF-8 as well. However, I think this whole debate about
choice of encoding gets blown out of proportion.

Consider the following diagram:

    
    
                                   [user-perceived characters] <-+
                                                ^                |
                                                |                |
                                                v                |
                      [characters] <-> [grapheme clusters]       |
                           ^                    ^                |
                           |                    |                |
                           v                    v                |
          [bytes] <-> [codepoints]           [glyphs] <----------+
    

Choice of encoding only affects the conversion from bytes to codepoints, which
is pretty straight-forward: The subtleties lie elsewhere...

------
raverbashing
Disagree

"UTF-16 is the worst of both worlds—variable length and too wide"

Really, the author tries to convince the reader, but it's not that clean cut.

One of the advantages of UTF-16 is knowing right away it's UTF-16 as opposed
to deciding if it's UTF-8/ASCII/other encoding. Sure, for transmission it's a
waste of space (still, text for today's computer capabilities is a non issue
even if using UTF-32)

"It's not fixed width" But for most text, it is. Sure, you can do UTF-32 and
it may not be a bad idea (today)

Yes, Windows has to deal with several complications and with backwards
compatibility, so it's a bag of hurt. Still, they went the right way
(internally, it's unicode, period.)

"in plain Windows edit control (until Vista), it takes two backspaces to
delete a character which takes 4 bytes in UTF-16"

If I'm not mistaken this is by design. The 4 byte characters is usually typed
as a combination of characters, so if you want to change the last part of the
combination you jut type one backspace.

~~~
lambda
> One of the advantages of UTF-16 is knowing right away it's UTF-16 as opposed
> to deciding if it's UTF-8/ASCII/other encoding. Sure, for transmission it's
> a waste of space (still, text for today's computer capabilities is a non
> issue even if using UTF-32)

First of all, if you don't know the encoding, then you don't know the
encoding, and you will need to figure out if it's UTF-8, UTF-16, ISO-8859-1,
etc. If you happen to know that it's UTF-16, you still need to figure out if
it's UTF-16BE or LE.

> "It's not fixed width" But for most text, it is.

This is a dangerous way of thinking. One of the big problems with UTF-16 is
that for most text, it is fixed width; so many people make that assumption,
and you never notice the problem until someone tries to use an obscure script
or an emoji character. This means that bugs can easily be lurking under the
surface; while with UTF-8, anything besides straight ASCII will break if you
assume fixed width, making it much more obvious.

> Sure, you can do UTF-32 and it may not be a bad idea (today)

UTF-32 isn't really meaningfully fixed width either. Sure, each code point is
represented in a fixed number of bytes, but code points are not necessarily
the interesting unit you want to index by. A glyph could be composed of
several code points. Most of the time, you actually want to deal with text in
longer units such as words or tokens, which are going to be variable width
anyhow. The actual width of individual code points is only really of interest
to low-level text processing libraries, not most applications.

~~~
raverbashing
"First of all, if you don't know the encoding, then you don't know the
encoding"

True. But as you said, you have to know if it's BE or LE on UTF16. And there
are ways to determine that automatically. (or it's on the same platform so it
doesn't matter). With "ASCII compatible" codes, you can't.

I guess the main issue to me is that UTF-16 is not "ASCII compatible" so you
know it's a different beast altogether.

And don't worry, I'm not assuming UTF-16 is fixed width. One should use the
libraries and not try to solve this 'manually'.

About UTF-32 think: CPU registers and operations. Working with bytes is
inefficient (even with the benefit of smaller size).

~~~
luriel
> True. But as you said, you have to know if it's BE or LE on UTF16.

Yes, with UTF-16 you need to know not just the encoding, but also the
endianness. That makes UTF-16 _worse_ , not better.

UTF-16 is really the worst of all possible worlds: tons of wasted space with
all the complexities of variable encoding without fixed endianness.

------
makecheck
Markus Kuhn's web page has a lot of useful UTF-8 info and valuable links (e.g.
samples of UTF-8 corner cases that people often miss).

<http://www.cl.cam.ac.uk/~mgk25/unicode.html>

~~~
lubutu
This is a great resource; it was extremely useful when I was writing a UTF-8
library myself. I found the UTF-8 stress test file is particularly useful to
run tests against:
<http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>

~~~
erichocean
Hmm, TextMate has problems with "5.2 Paired UTF-16 surrogates" in that stress
test file.

(Yes, I interpreted the file as UTF-8 in TextMate).

------
haberman
Totally agree re: UTF-8 vs other Unicode encodings.

But are there still still hold-outs who don't like Unicode? Last I heard some
CJK users were unhappy about Han Unification:
<http://en.wikipedia.org/wiki/Han_unification>

~~~
lmm
The main problem is that it means sort-by-unicode-codepoint puts things in a
ridiculous order in japanese/korean. I kind of wish UTF-8 had the latin
alphabet in a silly order, so that western programmers would realise they need
to use locale-aware sort when sorting strings for display.

~~~
jeffdavis
I spoke with several Japanese people who said that some valid characters are
not representable in Unicode.

That means that it's not just a technical problem (expensive sort routines or
inefficient encodings) -- it's a _semantic_ problem.

~~~
thristian
The way I've heard it explained, there are some historical alternate versions
of some characters (A Latin-alphabet equivalent might be the way we sometimes
draw "a" with an extra curl across the top, and sometimes without) that have
the exact same semantic meaning, and so they were 'unified' to a single code-
point. Unfortunately. some people spell their names exclusively with one
variant or the other, and Han unification makes that impossible in Unicode.

------
evincarofautumn
For those who don’t know it, UTF8-CPP[1] is a good lightweight header-only
library for UTF conversions, mostly STL-compatible.

[1] <http://utfcpp.sourceforge.net/>

------
tommi
That collection of best practices can hardly be considered as "UTF-8
Everywhere Manifesto" as it focuses on Windows and C++. It's good, but I'd
rather see more manifesto like document for all cases on a domain like that.

~~~
archangel_one
I suspect this is mainly because Windows C++ programmers are the largest group
that they feel need convincing. Which isn't totally their fault, Microsoft
haven't done well by them by not offering good support for UTF-8; you can
convert to/from it using WideCharToMultiByte but that's pretty low level, and
higher-level APIs like CString will cheerfully munge UTF-8 strings for you.
They also tend to conflate Unicode and UTF-16 which again doesn't help less
experienced programmers realise that there might be alternatives.

I've been through the Windows Unicode stuff at a previous job, which ended up
using mostly UTF-16 with some UTF-8 for interfacing to third party libraries
and for files which needed to be backward compatible to ASCII (plus
significant space savings, which I fought hard for). I think I prefer that
approach though, since after the (difficult) conversion you didn't need to
worry about encodings in 99% of the code. By their rules you'd gain
significant complexity by transforming all over the place in any non-trivial
GUI code.

------
antidoh
Text is maddening, the modern Tower of Babel.

Is there a definitive reference, or small handful of references, to learn all
that's worth knowing about text, from ASCII to UTF-∞ and beyond?

~~~
njs12345
Joel Spolsky's 'The Absolute Minimum Every Software Developer Absolutely,
Positively Must Know About Unicode and Character Sets (No Excuses!)' is a good
start: <http://www.joelonsoftware.com/articles/Unicode.html>

Like a few other specialised fields (cryptography comes to mind) the key
takeaway is to use a library and rely on the work of people who know it better
than you do and have handled all the subtleties already :)

------
mkup
I use UTF-8 for transmitted data and disk I/O, and I use UCS-4 (wchar_t on
Linux/FreeBSD) for internal representation of strings in my software.

I generally agree with this article, but I disagree with it on the point that
UTF-8 is the only appropriate encoding for strings stored in memory, and also
I disagree on the point wchar_t should be removed from C++ standard or made
sizeof 1, as in Android NDK.

Let me explain why.

In UTF-8 single Unicode character may be encoded in multiple ways. For example
NUL (U+0000) can be encoded as 00 or as C0 80. The second encoding is illegal
because it's longer than necessary and forbidden by standard, but naive parser
may extract NUL out of it. If UTF-8 input was not properly sanitized, or there
is a bug in charset converter, this may result in exploit like SQL injection
or arbitrary filesystem access or something like that: malicious party can
encode not only NUL, but ", /, \ etc this way.

Also UTF-8 string can't be cut at arbitrary position. Byte groups (UTF-8
runes) must be processed as a whole, so appear either on left side or on the
right side of cut.

Reversing of UTF-8 string is tricky, especially when illegal character
sequences are present in input string and corresponding code points (U+FFFD)
must be preserved in output string.

I think UTF-8 for network transmitted data and disk I/O is inevitable, but our
software should keep all in-memory strings in UCS-4 only, and take adequate
security precautions in all places where conversion between UTF-8 and UCS-4
happens.

And sizeof(wchar_t)==4 in GCC ABI is not a design defect, wchar_t exists for a
good reason. I admit that sizeof(wchar_t)==2 on Windows is utterly broken.

~~~
ubershmekel
Concerning "cut at an arbitrary position" actually utf-8 is the only codec
that can deterministically continue a broken stream because bytes that start a
character are special.

------
erichocean
The strangest thing about Unicode (any flavor) is that NULL, aka \0, aka "all
zeros" is a valid character.

If you claim to support Unicode, you have to support NULL characters;
otherwise, you support a subset.

I find most OS utilities that "accept" Unicode fail to accept the NULL
character.

FWIW, UTF-8 has a few invalid characters (characters that can never appear in
a valid UTF-8 string). Any one of them could be used as an "end of string"
terminator if so desired, for situations where the string length is not known
up front.

We could even standardize which one (hint hint). I suggest -1 (all 1s).

UPDATE: I meant "strange" as in "surprising", especially for those coming from
a C background, like me.

~~~
lubutu
No. NUL is backwards-compatible with ASCII, and is used everywhere. Choosing
some arbitrary invalid UTF-8 byte for use as a terminator would be a terrible
decision. If you want to handle NUL, simply use length-annotated slices
instead of C-style NUL-terminated strings. Anything else is completely wrong.

~~~
erichocean
Did you even read my comment? NULL is a valid UTF-8 character.

If specific languages and their standard libraries _choose_ to treat it as a
string terminator (C, I'm looking at you), well, then fine.

But it's still valid Unicode. If you claim to support Unicode strings, but
don't support the NULL character in those strings, you don't support Unicode
strings in their entirety.

Going further, there's nothing in the ASCII spec that requires NULL to only
appear at the end of a valid string. That's a C language convention, AFAIK
(maybe it started earlier...).

~~~
bobbydavid
I think what you are trying to say is:

"because UTF-8 has invalid character sequences, we could potentially use one
of them to represent end-of-string, which would allow us the flexibility of a
null-terminated string (not keeping track of the length) without the
restriction of no-nulls-allowed."

You're right! Great. But you are not revealing a "strange thing" about
Unicode. You are instead making a general comment about null-terminated
strings. So why use such inflammatory and misleading language like "If you
claim to support Unicode, you have to support NULL characters"?

Update: I don't object to your idea at all, it's a neat trick! It's just that
the way it's phrased, it sounds like Unicode's design _contributed_ to this
NULL-terminal problem, when in fact even NULL-terminated ASCII strings cannot
'handle' a null character in this sense.

To augment your idea, though, how about you use '0xFF 0x00' as a terminator?
This way, backward-compatibility is preserved in all cases except UTF-8 =>
ASCII with NULLs, and in this case the string will be truncated rather than a
buffer overflow (i.e. "fail closed").

------
sopooneo
Can someone explain to me how UTF-8 is endianness independent? I don't mean
that I am arguing the fact, I just don't understand how it is possible. Don't
you have to know which order to interpret the bits in each byte? And isn't
that endianness?

~~~
kijin
It's endianness independent in the sense that the order in which you interpret
the _bytes_ in each _character_ does not depend on the processor architecture,
unlike UTF-16.

If your processor interprets the _bits_ in each _byte_ in a different order,
that might be a problem, but it's not what we're talking about when we usually
talk about the endianness of character encodings.

<http://en.wikipedia.org/wiki/Endianness>

~~~
sopooneo
Thank you. That is very good to learn and I looked over the wikipedia article.
But as far as _byte_ order, how is that architecture independent? Is it just
that utf-8 dictates that the order of the bytes always be the same, so
whatever system you're on, you ignore its norm, and interpret bytes in the
order utf-8 tells you to?

~~~
ori_b
utf-8 is a single byte encoding. Reversing the order of a sequence that's one
byte long just gives back that one byte.

~~~
kijin
No it isn't. Any letter with an accent will take up two bytes. Most non-Latin
characters take up three bytes, sometimes even four.

~~~
ori_b
Poorly phrased. It can take multiple bytes to fully define one codepoint, but
the encoding is defined in terms of a stream of single bytes. In other words,
each unit is one byte, hence flipping each unit gives back the same unit.

This is not the case for UTF-16 and UTF-32.

~~~
ybungalobill
AFAIK the correct term is "byte oriented".

------
chj
can not agree more! it will be a much better world if we all use utf8 for
external string presentation. i don't care about what your app use internally,
but if it generates output, please use utf8.

------
CJefferson
I there a simple set of rules for people who currently have code which use
ASCII, to check for UTF-8 cleanness?

In particular, what should I watch out for to make an ASCII parser UTF-8
clean?

~~~
makecheck
If you're reading something in pieces, like a buffer that fills 256 bytes at a
time, you have to be careful. UTF-8 is a multi-byte encoding so the last byte
in your buffer may not completely finish a code point. Unlike older code that
can just read a bunch of bytes and use them, with multi-byte encodings you
have to have a way to deal with "left-overs" until new bytes show up.

Fortunately the UTF-8 encoding (e.g. see the Wikipedia page) makes it clear
when a byte is the beginning of a new point and it tells you how many
intermediate bytes should follow.

------
breck
How could we avoid acronyms like 'utf-8'?

We can do better than that. Unicode8?

~~~
paulsutter
Just use the term "string" to refer to utf-8, and the term "data in
nonstandard encoding X" to refer to other encodings.

In the article he puts in in terms of std::string, but more generally I think
this is what he means.

~~~
adamtj
You're confusing things. Strings cannot be utf-8 any more than you can be your
signature.

"strings" are abstract data structures. They are lists of characters. Not
bytes, not integers, but characters. Often, we use the Unicode character set
as the set of allowable characters. There are other character sets.

Internally, strings often represent characters as integers. When using the
Unicode character set, strings then use the Unicode encoding to integers (a
table mapping characters to unique numbers). Sometimes we use other character
sets and encodings.

Unfortunately, integers are abstract. You can't store them in a file or
transmit them over a network until you pick a concrete representation as
bytes. How many bits per integer? Big or little endian? Etc. That's where
UTF-8 comes into play.

UTF-8 is a merely a compressed data format used to represent a sequence of
integers as a sequence of bytes - a way that happens to have some properties
that make it convenient for representing strings.

UTF-8 is not Unicode.

UTF-8 can also be used for other types of numerical data. As a silly example,
suppose you had a list of ages of houses. Many houses are less than 100 years
old. A few are more than 300 years old. An efficient serialization of that
data would be to represent the ages as integers and then utf-8 encode your
list of integers.

Some true statements: A character set is a set of characters. Characters are
not integers or bytes. A mapping from characters to integers is an encoding.
Unicode is a standard that defines a character set and an encoding to
integers. Mapping integers to bytes is confusingly also called encoding. UTF-8
is an encoding from integers to bytes. Unicode defines a set of characters and
an encoding of characters to integers. UTF-8 is an encoding of integers to
bytes. UTF-8 is not Unicode.

~~~
paulsutter
I'm just saying that if we standardize on encoding, we dont need to talk about
encoding. Which is my interpretation of the original document.

Separate point: There is no such thing as an abstract string or integer in a
computer, no matter what language you are using. Every string in a computer
has an encoding - you have to store it as ones and zeros.

If we standardize on UTF-8 as an encoding, we just dont need to use the
awkward phrase "UTF-8" in ordinary conversation.

------
scoith
That page is misleading when it comes to Japanese text: UTF-8 sucks for
Japanese text. UTF-8 and UTF-16 aren't the only two choices within the whole
world, which is demonstrated in their choice of encoding Shift-JIS.

~~~
ruediger
Can you elaborate on that? Why does Unicode suck for Japanese text?

~~~
byuu
Not only kanji, but also hiragana and katakana (syllabic alphabets) encode to
three bytes per character. Shift-JIS can encode all three to two bytes, as
well as half-width katakana to one byte per character.

However, if size is such a concern (eg for web transmission), text compression
neutralizes the perceived benefit of region-specific encodings.

Shift-JIS' continued popularity has much more to do with change aversion than
it does technical merit.

~~~
jeffdavis
As I said above, I spoke with several Japanese people who said that some valid
characters are not representable in Unicode.

Some details can be found here: <http://en.wikipedia.org/wiki/Han_unification>

------
fleitz
tl:dr; Use UTF-8 when you need to use unicode with legacy APIs, never anywhere
else.

UNIX isn't UTF-8 because UTF-8 is better, UNIX is UTF-8 because you can pass
UTF-8 strings to functions that expect ASCII and it kinda works. This is
really the only thing you need to know about UTF-8 and why it's better.

There are few pieces of software that don't have to talk to legacy APIs that
store strings natively in UTF-8.

C# and Java are probably the best examples of software that was engineered
from the ground up and thus uses UTF-16 internally because it's much less
likely to run into issues like String.length returning 32 yet only containing
31 characters. If you use UTF-8 expect this result anytime a string contains a
real genuine apostrophe.

"UTF-8 and UTF-32 result the same order when sorted lexicographically. UTF-16
does not."

This is complete and utter bullshit, to sort a string lexicographically you
need to decode it, if you've decoded the string into UNICODE then they sort
the exact same way.

There are lots of gotchas for sorting UNICODE strings including normalization
because you can write the semantically equivalent strings in unicode multiple
ways. eg. ligatures.

If you're sorting bit strings that happen to contain UTF-8/32 then you're not
sorting lexicographically and your results will be screwed up anyway.

~~~
gwillen
> decoded the string into UNICODE

I think you are quite confused.

1) Unicode is not an acroynm. 2) You cannot "decode into Unicode". I think you
mean "decode into codepoints". 3) If that is what you mean, then you are wrong
about sorting: Sorting UTF-8 and UTF-32 bytestrings will indeed sort them
lexicographically by code point, which was the author's point. No, that will
not generally be the sort you _want_; but no amount of 'decoding' will give
you the sort you want. For that you need to first normalize, and then follow
the collation rules, which don't sort by raw code points at all.

------
alecco
ASCII and UTF-8 are too US centric. That's why adoption in places like China
is so low.

Also, if there's variable length encoding why can't we just do a proper way
and improve size for the same computational cost?

~~~
lambda
Did you read the article, including the part about Asian text? Like it or not,
most text these days is embedded in markup languages like XML or HTML, in
which all of the markup is within the ASCII range. This, coupled with the fact
that UTF-8 gives you a factor of 2 savings over UTF-16 for the ASCII range,
while only a factor of 1.5 increase over UTF-16 for CJK characters, means that
for much text (such as anything on the Web), UTF-8 is actually smaller than
UTF-16 even for CJK text.

Yes, ASCII is obviously too US centric; you can't encode any writing systems
other than the Roman alphabet in ASCII. However, that's not at question here.
The question is, which Unicode encoding should you use, so you can represent
all writing systems in a single encoding. And the major contenders are UTF-8
and UTF-16. The point of this article is, for that purpose, UTF-8 is a far
better choice.

> Also, if there's variable length encoding why can't we just do a proper way
> and improve size for the same computational cost?

What do you mean by a "proper way"? If size is what you care about, just
compress your data. Compression will do a lot better for a much wider range of
data than some clever encoding will. UTF-8 is a carefully constructed encoding
designed to meet several design criteria. For instance, you could get better
size for a wider range of character sets by having a single byte to represent
switching between character sets; so you could use that byte, and then a whole
bunch of 2 byte CJK characters. But that would defeat one of the design goals
of UTF-8, which is to be self synchronizing. That means that if you get a
partial sequence (such as a sequence that has been truncated), you can start
decoding the characters after a fixed number of bytes. In the case of UTF-8,
you will never have to go more than 3 bytes before you can start decoding
again. In my hypothetical scheme where certain symbols were used to switch
between character sets, you would not be able to interpret anything until you
found the next such symbol. This makes UTF-8 more robust in the face of
errors.

Another design goal of UTF-8 was to be backwards-compatible with ASCII. Like
it or not, ASCII has been the standard encoding for decades, and there is a
lot of text in ASCII and a lot of software that uses ASCII delimiters and the
like.

So, while it would be possible, in theory, to define a character encoding that
is more "fair" than UTF-8, that ignores many of the other goals of the design
of UTF-8. And UTF-8 is widely supported and used (it is the most popular
encoding on the Web, even in places like Japan, and a close second in China),
while a new encoding would require another large, global, and painful
transition process to introduce.

~~~
muyuu
The author compares UTF-8 to UTF-16, while there are a myriad better encodings
than both for different Asian languages.

For instance: EUC-JP in Japan, BIG5 in Taiwan, GB in China, etc. Different EUC
encodings are variable-length and are a lot more efficient since they put
common subsets in lower parts of the table for each language, close to each
other so they also compress better, while allowing tricks for text matching
and searches (not really necessary for web sites, but it's nice to use the
same encoding throughout applications sometimes). Russian and Greek are also
basically multiplied by two in size.

There are a lot of other considerations.

If you think a 30%+ saving in size (and latency) is not a big deal, then
you're a lot more likely to lose to local competitors. Note that gzipped or
otherwise compressed text makes differences even worse at least in the case of
Japanese - where UTF-8 text gets de-aligned all over the place to odd byte
sizes and compresses worse. Add to that the fact that Asians browse the net A
LOT from the phone and have done so for much longer than westerners and in a
bigger percentage, and you have your problem exacerbated even further.

There is a lot more to consider and like it or not it's not as simple as
"UTF-8 for everything and everybody, ever!"

~~~
kijin
Have you ever worked with a system that needs to deal with more than one
language at the same time? What if your users want to mix Japanese with
Russian in the same sentence? Or Japanese and simplified Chinese? (Yes, people
do that.)

In the global Internet, UTF-16 and UTF-8 are the only games in town.

~~~
muyuu
All the damn time I'm using several languages.

Then UTF (and EUC's) are the way to go.

It's not like you have to use the same encoding all the time.

~~~
derleth
> It's not like you have to use the same encoding all the time.

Then you are going to feed someone garbage. Why feed people garbage?

~~~
muyuu
??

Not if you know what you're doing. Not any more than using utf8 exclusively
all the time and for all purposes.

~~~
derleth
> Not if you know what you're doing.

This is nice in theory. In practice, people make mistakes. Make it easy on
yourself.

> Not any more than using utf8 exclusively all the time and for all purposes.

Maybe I was unclear: Feeding me Chinese text in UTF-8 is _not_ garbage.
Feeding me _anything_ in one of the GB encodings _is_ garbage.

Garbage, to me, is text in an encoding I can't handle. If you only use UTF-8,
that cannot possibly happen.

------
natch
Strings (NSString) on Apple platforms are UTF-16. The Apple platforms are not
exactly lagging behind in either multilingual, or text processing. I wonder
what this team of three people knows that Apple doesn't? Or is it the other
way around, that Apple knows something they don't, and when it comes to
shipping products that work in the real world, Apple has figured out how to do
it?

~~~
__david__
NSStrings are opaque--you always call accessor functions and never have access
to the low level backing store. The reason they are good is that you can't get
data into or out of them without specifying an encoding, which leaves the
actual encoding of the backing store as an implementation detail.

The fact is, I don't even _know_ (or see documented) that the backing _is_
UTF-16--Apple is free to change that at their whim and no user programs would
break.

~~~
brigade
It's not documented (presumably) for that very reason.

In fact, the opposite is implied by
initWithBytesNoCopy:length:encoding:freeWhenDone: - it should be possible
_right now_ to have NSStrings with arbitrary internal representations, even if
most other creation methods currently convert to UTF16.

