Hacker News new | comments | show | ask | jobs | submit login
The UTF-8-Everywhere Manifesto (utf8everywhere.org)
381 points by bearpool on Apr 29, 2012 | hide | past | web | favorite | 182 comments



Really good article. You'll get nothing from me but heartfelt agreement. I especially liked that the article was giving numbers about how inefficient UTF8 would be to store Asian text (not really apparently).

Also insightful, but obvious in hindsight: Not even in utf-32 you can index specific character in constant time due to the various digraphs.

The one property I really love about UTF8 is that you get a free consistency check as not every arbitrary byte sequence is a valid UTF8 string.

This is a really good help for detecting encoding errors very early (still to this day, applications are known to lie about the encoding of their output).

And of course, there's no endianness issue, removing the need for a BOM which makes it possible for tools that operate at byte levels to still do the right job.

If only it had better support outside of Unix.

For example, try opening a UTF8 encoded CSV file (using characters outside of ASCII of course) in Mac Excel (latest versions. Up until that, it didn't know UTF8 at all) for a WTF experience somewhere between comical and painful.

If there is one thing I could criticize about UTF8 then that would be its similarity to ASCII (which is also its greatest strength) causing many applications and APIs to boldly declare UTF8 compatibility when all they really can do is ASCII compatibility and emitting a mess (or blowing up) once they have to deal with code points outside that range.

I'm jokingly calling this US-UTF8 when I encounter it (all too often unfortunately), but maybe the proliferation of "cool" characters like what we recently got with Emoji is likely going to help with this over time.


"The one property I really love about UTF8 is that you get a free consistency check as not every arbitrary byte sequence is a valid UTF8 string."

You don't get this at all using UTF-8. You only get it if you attempt to decode the string which even something like strlen doesn't do. Strlen will happily give you wrong answers about how many characters are in a UTF-8 string all day long and never ever attempt to check the validity of the string. Take your valid UTF-8 and change one of the characters to null, now it doesn't work in many circumstances with 'UTF-8' code.

Also, should the free consistency check ever actually work you're in a bigger pickle as you now have to figure out whether the string is wrongly encoded UTF-8 or someone sent you extended ASCII.

I did a lot of work with unicode apps. I used to have a series of about 5 strings that I could paste into a 'UNICODE' application and have it invariably break.

One was an extended ASCII string that happend to be valid UTF-8 sans BOM :)

One was a UTF-8 string with BOM and has 0x00 inside :) (I call this string how to tell if it was written with C)

One was a UTF-8 string with a BOM :)

One UTF-8 string with a some common latin characters, a couple japanese, and a character outside the BMP.

Two UTF-16 strings in LE/BE with and sans BOM.


>> "The one property I really love about UTF8 is that you get a free consistency check as not every arbitrary byte sequence is a valid UTF8 string."

>You don't get this at all using UTF-8. You only get it if you attempt to decode the string which even something like strlen doesn't do.

I wasn't talking about using strlen (aside of when I was jokingly talking about US-UTF8 where I've seen instances of strlen() being used against UTF-8 strings). I was talking of using library functions designed to handle UTF-8 encoded character data (which strlen() and friends are not).

What I meant with "free consistency check" was that any library function that is designed to deal with UTF-8 data is by default put into a position where it can quite safely determine whether the input data given to it is in fact in UTF-8 or not.

This is not true for any other character encoding I know of (I don't know about the legacy 2-byte Asian encodings at all).

In legacy 8bit character sets, there's nothing you can do to check whether you have been lied to aside of analyzing the content, trying to guess the language and map that to the occurrence of characters in the character set you have been told the string is in (pretty much unfeasible).

With UTF-16 you can at least use some heuristics if you are dealing with common english texts (every second byte would be 0), but you can't be sure - especially not when the text consists of primarily non-ASCII text.

Only with UTF-8 you can take one look at input data and determine with quite a bit of confidence whether the data you have just been handed is in fact in UTF-8 or not (it might still be pure ASCII, but that still qualifies as UTF-8).

If you ever get lied to and somebody tries to feed you ISO-8859-1 claiming it to be UTF-8 (happens all the f'ing time to me), then any library or application designed to deal with UTF-8 can immediately detect this and blow up before you store that data without any way to ever find out what encoding it would have been in.


"One was an extended ASCII string that happend to be valid UTF-8 sans BOM :)"

Do you mean you pasted in a string of bytes that was valid UTF-8 into an app expecting UTF-8, and it didn't decide to convert it into ISO 8859-something based on some heuristic?

Sounds like correct behavior to me.


> You only get it if you attempt to decode the string which even something like strlen doesn't do.

Because strlen() is a count of chars in a null-terminated char[], not a decoder. Ever. It's character set agnostic.

> Strlen will happily give you wrong answers about how many characters are in a UTF-8 string all day long and never ever attempt to check the validity of the string.

Because, again, strlen() counts chars in a null-terminated char[]. It is giving you the right answer, you are asking it the wrong question.

> Take your valid UTF-8 and change one of the characters to null, now it doesn't work in many circumstances with 'UTF-8' code.

Which means it's not a valid UTF-8 decoder, but is instead treating the buffer as Modified UTF-8[1].

> that I could paste into a 'UNICODE' application

Clipboards or pasteboards in many operating systems butcher character set when copying and pasting text. Generally, the clipboard cannot be trusted to do the right thing in every circumstance. On Windows, in particular, character set can get transposed to the system character set or something rather arbitrary when text is copied.

> One was a UTF-8 string with BOM and has 0x00 inside :) (I call this string how to tell if it was written with C)

> One was a UTF-8 string with a BOM :)

Don't use the BOM[2] in UTF-8. It's recommended against.

So really, your point is that some implementations are bad, and you have a bag of tricks for breaking implementations that don't handle all corner cases? That's pretty universal even in the non-Unicode world; there's bad implementations of everything. Windows is an especially bad implementation of most things Unicode.

A valid decoder will, indeed, consistency-check an arbitrary string of bytes as UTF-8. The OP is correct, and your corner cases don't refute his point.

[1]: http://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8

[2]: http://en.wikipedia.org/wiki/UTF-8#Byte_order_mark


A name like strlen suggests that it's designed to take the length of a string, if it was called count_null_ter_char_array then I'd tend to believe you. It's not character set agonistic, it's monotheistic at the shrine of ASCII, it's all over the coding style.

Null is valid UTF-8, it just doesn't work with C 'strings'. I can get null out of a UTF-8 encoder with no problem.

My point is that UTF-8 is nowhere near the panacea being described and if you have to touch the strings themselves that it's far better to use UTF-16 in the vast majority of cases. The only time you ever really want to use UTF-8 is if you're dealing with legacy codebases, it's a massive hack.


I do not understand how UTF-16 could be better for this reason. wcslen works exactly like strlen but on wide chars instead of chars.


> A name like strlen suggests that it's designed to take the length of a string

It is. A "string" in C is a char[] (there is no "string" type). A char is a type that is a number. That number has no meaning aside from being a number. Conveniently, you can assign a char like so:

    char foo = 'b';
That sets the variable foo, of type char, to the value 98. That the 98 means anything, in particular, the letter 'b' in many character sets, is a complete accident and completely orthogonal to char's purpose. A "string" in C is a collection of chars. That is all. No encoding (especially not "the shrine of ASCII"), no purpose beyond being an array of numbers that end in 0, just a bunch of numbers.

You are misunderstanding "strings" in C, and by extension, strlen(). This is not a problem with UTF-8. This is a problem with you misunderstanding the C library and basic types. If you don't believe me (I'm right, but, your call), you can certainly download the C99 spec and investigate what a "char" is, what a "string" is (hint: there isn't such a thing at all), and what "strlen()" is designed to be.

Here's a simple, naive strlen():

    size_t strlen(char *string) {
        char *p = string;
        while(*p) p++;
        return p - string;
    }
That's it. No "monotheism at the shrine of ASCII". It counts chars until it finds 0. It is giving you the right answer. That you don't understand the answer is not UTF-8's (or C's) problem at all. Now, if you want to talk about printf(), I'm listening -- because you might be able to conjure up a point there -- but you are not talking about printf(). This, and other comments, are way off-base on how strlen() works.

> Null is valid UTF-8, it just doesn't work with C 'strings'.

Sure it does! I can store a null in a char[] all day long. That just changes its behavior when passed to something that counts the length of a char[] before a terminating null (like, wait for it, strlen()). Watch!

    char buf[8];
    buf = "abcde\0f";
What we have here is a buffer of length 8, which contains these char values:

    97 98 99 100 101 0 102 0
Now, strlen(buf) is 5. That's because that's what strlen is designed to do. The actual length of the buffer is, amazingly, still eight, and if your code expects to work with all eight chars in the char[], then by golly, it can.

If you are using strlen() with any expectation of character set awareness or human alphabet behavior, you completely misunderstand the purpose of strlen().

Since you're so adamant that UTF-16 is better (but you completely misunderstand how C's typing works), I'm less inclined to accept your opinion on UTF-8 being a "massive hack". Explain to me what strlen() on a buffer containing a UTF-16 string does -- and, why that's better -- and I might come around.


Yea, reminds me of DBCS. UTF-8 however don't use bytes below 0x80 as anything other than an ASCII character, unlike some DBCS encodings such as Shift-JIS.


Ok, let me be the first approving top level comment: This document is correct. The author of this document is smart. You should follow this document.

As jwz said about backups: "Shut up. I know things. You will listen to me. Do it anyway."


Yes! I have been meaning to write something like this for years.

There is only one thing I would add: Never add a BOM to an UTF-8 file!! It is redundant, useless and breaks all kinds of things by attaching garbage to the start of your files.

Edit: Here is the interesting story of how Ken Thompson invented UTF-8: http://doc.cat-v.org/bell_labs/utf-8_history


The mark isn't useless; it clearly identifies files as UTF-8 so they can be processed as such immediately. Otherwise a program has to "sniff" several bytes to see if the encoding could be something different, and it may not guess correctly.

Also, how can "all kinds of things" break with this mark? If something is reading UTF-8 correctly then it'll be fine with the mark; and if it's not reading UTF-8 correctly then it will screw up a lot more than the mark at the beginning of the file.


This argument is silly. Why not prefix every UTF-8 string with a BOM then? It's wasteful and unnecessary, because UTF-8's clean structure already makes it trivial to detect, and false positives are all but impossible for real-world text. There's a paper out there that proves this.

The UTF-8 BOM was a Microsoft invention. Nobody else uses it, and it breaks tons of things. Two examples off the top of my head: Unix hashbang scripts (i.e. #!/bin/bash), and PHP scripts (the BOM will trigger HTTP header finalization before any code is run).


Not knowing a lot of things about encoding (my bad), BOM wasted me a lot of hours of ajax call debugging in a php app.


You wouldn't prefix every string with it because presumably your API or program's state has already determined the string's encoding. I am not suggesting that every fragment of text has to be explicit (I agree that would be ridiculous). I am only stating facts: there is nothing incorrect about having the mark, a conformant reader must be able to handle the mark, and the mark has some value as a short-cut for avoiding elaborate decoding tricks.


The thing is, the BOM is metadata, it doesn't belong in content. It violates the contract of .txt files, which is: the entire file is a single string of content.

Recognizing it at the edges of your program and stripping it out is not the end of the world, but it's annoying and no other (8-bit) encoding works that way. In fact, I find it hard to believe UTF-8 BOMs in MS programs were anything more than a programmer error. Once such files were out in the wild, everyone else had to deal with them.


There are already plenty of cases that valid UTF-8 readers have to deal with (unused ranges of code points, invalid byte combinations, etc.). Ignoring a BOM is trivial by comparison. A UTF-8 reader honestly doesn't care about the "stringness" of a .txt file because of all the other crap that can be in a byte stream.

Older programs do care, but as I've said elsewhere in the thread an ASCII file can remain ASCII (no BOM). There's no reason to BOM-ify an old ASCII file if it really is ASCII and only ASCII-expecting programs will ever use it.

Over time these old programs will either be upgraded or go away and it will finally be safe to say that inputs must be UTF-8. At that time, the BOM has no reason to exist.


A lot of software predates UTF-8.

For example, scripts on Unix-like (Unix, Linux, BSD) systems. In order to determine that a file should be executed using an interpreter, they look for the ASCII bytes "#!" at the beginning of the file. If those bytes are not present (such as if the file is UTF-8 with a BOM), then it won't be executed with an appropriate interpreter.

Now, once the interpreter is found, it can interpret the rest of the text as it sees fit (for instance, interpreting it as proper UTF-8). But the first two characters of the file must be "#!".

Furthermore, the BOM is simply a bad idea. It was poorly conceived from the beginning. It means something different at the beginning of a file than it does in the middle (at the beginning, it is the BOM; in the middle, it is a zero-width non-breaking space. This means that you can't simply append files or strings that begin with the BOM, and have the result still be valid.

The BOM should not be used for sniffing files, except as a fallback. Sniffing is a terrible way to tell what encoding a file is in. The format should be recorded somewhere (in file metadata, by default, by the locale, or something of the sort). Sniffing should only be used as a last resort.


"Sniffing is a terrible way to tell what encoding a file is in."

In theory yes, in practice it's an awesome way to find out because the sniffer is more accurate then the webserver/mail client,human etc. Just about every piece of software I've written that has to deal with multiple encodings uses a sniffer, even if the format specifies it it's a good idea to check with a sniffer.


The BOM is in-band signalling. It breaks the ASCII-backwards-compatability of UTF8.

BOMs are only necessary where the provenance of data is not known. Normally, there is a context provided which will determine the encoding.

Basically, yes - text should be tagged (unless 'utf8 everywhere' wins) but imho the tagging should be external to the content.


I agree with the use of context, yes; if you already know your input is UTF-8 (e.g. C strings in a program API or a protocol or whatever) there's no point in adding an extra specifier.

If a program requires ASCII compatibility in order to work then by all means make the input files ASCII (no BOM), just make sure the files have no true UTF-8 dependencies in them.

Once a program supports UTF-8 "properly" however, the BOM is useful as a signal that the input is somewhat complicated.

At some point in the future when UTF-8 really is everywhere and programs may no longer even try to sniff encodings, etc. then yes, the BOM has no real reason to exist.


But you have the same problem with text or binary distinctions. On some platforms, no distinction between them needs to be made (quite usefully, adding to tool simplicity and conceptual simplicity).

If you use BOM, a tool which can operate on text or binary must be told which it is operating on. This would have to be done via some external context (e.g. cmdline switch). And that would never go away, even in a utf8 everywhere world.

Basically, in BOM-world: - You either need to tag each fragment of text or you still need to use external context (e.g. which encoding do I get text columns from my database) - You perpetually need to differentiate between binary and text data for all tools which do nothing more complicated than read and write

and in non-BOM-world you: - add to the contextual clues you need anyway something like "any files which you are going to interpret as text on the system should be interpreted as utf8" - when moving data on or off the local the local system, use a network protocol which supports tagging the text payload (e.g. email, http).

The problems arise mostly with file shares (or their equivalent, version control systems) where text files are exchanged without an accompanying protocol. That is where "BOM world" or "UTF8 world" will ultimately have to settle their differences.

BOM-world would like all systems, everywhere, to make a text/binary distinction for ever. UTF8-world would like to say that textual data lacking a context should be interpreted as UTF8. But feel free to use UTF16/UTF32 for specific purposes or systems.


What on earth does "processed as such" mean? UTF-8 can be "processed" anywhere ASCII can, that's the whole point. The only point to the BOM is to distinguish it from UTF-16 (or UCS2, which is usually what UTF-16 degenerates into). And UTF-16 is broken garbage and shouldn't be used.

Given that, your last sentence is basically an ode to complexity. Taken to the logical conclusion you'd support any rule, no matter how ridiculous, as long as it is agreed upon as "correct" by ... someone.

But I don't agree: the BOM hurts, it doesn't help.


No, UTF-8 cannot be processed in exactly the same way as ASCII! It is highly compatible with ASCII-assuming environments because it can accept the same data (e.g. it remains a stream of bytes that doesn't have to be converted to some "fat" integer) and any UTF-8 text that just happens to contain only ASCII characters will work just fine with an old ASCII program. But complex multi-byte UTF-8 inputs will not work without special treatment.

Historically code pages mapped character values directly to individual bytes: typically redefining the upper half (129+) while leaving the lower numbers the same as ASCII. Programs could display a wide variety of text encodings correctly as long as they knew what the encoding was, and all they had to do was read bytes individually. These encodings could all be handled in a way largely similar to ASCII.

UTF-8 however is a multi-byte encoding, which means you could have (say) 3 bytes that combine to form a symbol. Not only that, but in certain forms multiple symbols could imply display as a single glyph (e.g. an accent followed by a letter). A program that does everything the old ASCII way would choke on multi-byte UTF-8, despite being otherwise-compatible. Consider a program with a fixed-size read buffer; if the last byte is only partway through a multi-byte character, that character will be mishandled unless the program knows how to preserve those bytes and "complete" the character when more bytes arrive.


You're thinking from the point of view of a display program that needs to split strings into glyphs. That's a fine application, but very rare. And yes, it's inherently encoding dependent and tends to like wide characters instead of multibyte ones.

But introducing a BOM for the sake of that application is a disaster, because it hurts everything else. You can (literally) feed UTF-8 to parsers written 30 years ago and apply all your existing intuition about string handling in C without worry. Unless you deliberately break it by including a binary, non encoding garbage furball at the front of your "file" (and good luck figuring out what a "file" should mean in a OS metaphor designed around streams).


If a file really is pure ASCII, leave it that way. I am not suggesting to do otherwise. If a 30-year-old program only deals with ASCII then make sure your input looks like ASCII.

But if your input could contain complex UTF-8 (e.g. it's multi-language or whatever), you're not doing any favors by hiding this fact. The BOM is a quick way to know exactly what the file is, and it shows you that your program won't work with that input. So you translate the input or you fix the program.

At some point in the future the majority of programs will handle even complex UTF-8 properly, and then the BOM will be pointless because virtually all inputs will be UTF-8.


> Q: What do you think about BOMs?

> A: Another reason not to use UTF-16. UTF-8 has a BOM too, even though byte order is not an issue in this encoding. This is to manifest that this is a UTF-8 stream. If UTF-8 remains the only popular encoding (as it already is in the internet world), the BOM becomes redundant. In practice, many UTF-8 text files omit BOMs today.

That's one of the central points of the article --- "sniffing" is completely unnecessary if you can assume a UTF-8 encoding type.

You're right that the mark isn't useless, we want it to be useless.


    cat a b c
There you go, a BOM in the middle of the file.


if the BOM appears in the middle of the file than it should be considered a zero width non breaking space (so basically ignored) so why is this an issue?


Here is an issue (that has bitten me and took lots of blood and sweat to find out):

cat a b c | grep '^foo'


Only if it's treated as human-readable text. But a great many text files are not, in fact, purely meant as human-readable text.


Unicode-compatible cat(1) would strip BOM from all but the first file.

Applying non-Unicode compatible utility to Unicode files of course doesn't work.


Cat doesn't know and doesn't need to know what kind of input it gets, it's just for concatenatig files. Files themselves can be binary for all that matters (indeed, it was used very often to concatenate tape devices to be used by tar further down in the pipe).


Agreed. The user of "cat" is responsible if they've created a stream that is invalid input to some other program. Unix philosophy is for programs to do (ideally) one simple thing well. If the contents of any one file cannot be used directly, then the idea would be to run some other filter program first (e.g. one that strips bytes from the beginning) prior to using "cat" and piping the results somewhere else.

In practice though some programs and protocols have text/binary distinction (e.g. FTP). It's not unreasonable to have a mode that hints when bytes are to be used as text. This is frequently done to handle different new-line styles for instance.


...and you think this is an argument in favor of including a BOM?


I prefer to solve the problem in the right place.

Raw concatenation of bytes without encoding-awareness introduces the possibility that the bytes will combine in unexpected ways. The presence of something as obvious as a BOM makes it harder to make this mistake, at least during the transition phase to "UTF-8 everywhere".

What you really want in this situation is something that forces you to see the potential bug and introduce the correct translation and/or text-concatenation tools to fix it.

And yes, at some point in the future enough tools will be truly aware of UTF-8 that the BOM will not have a reason to exist. But right now it has some value.


So, first you argue that UTF-8 without BOM might split glyphs at incorrect places, and then it's pointed out that UTF-8 with BOM concatenates files incorrectly, and somehow this is also an argument for using BOM?

I'm sorry but it's obvious both approaches have rather symmetrical opportunities to produce bugs. So in that case the best choice seems to be the simplest one. Which is the one without BOM.


Applying non-Unicode-compatible utilities to UTF-8 files works fine; that's why UTF-8 was invented. Adding a byte-order mark to a UTF-8 file is what causes non-Unicode-compatible utilities to stop working.


What would you do differently if the mark was there versus if it wasn't?

ASCII is a perfect subset of UTF-8, so any operations you would do on UTF-8 are also operations you would do on ASCII. The behavior of your program wouldn't change. The BOM is a no-op when it comes to how your program handles things.

However, it has the potential of confusing older programs. It makes things that should be simple (like 'cat') need to be encoding aware, and modal. It means that streaming multiple files one after the other breaks.

So, in summary, the BOM doesn't buy you anything, or give you any important information about a file. It does make things harder.

It should die.


The BOM is not legal UTF-8. Do not put a BOM in your UTF-8 documents. (Accept a BOM in other people's UTF-8 documents, because Postel's Law, and because other people are dumb. But don't perpetuate it...)


Just assume that it is UTF-8, no sniffing needed. If it's not, then convert it to UTF-8.


Sadly, the pervasiveness of JavaScript means that UTF-16 interoperability will be needed as least as long as the Web is alive. JavaScript strings are fundamentally UTF-16. This is why we've tentatively decided to go with UTF-16 in Servo (the experimental browser engine) -- converting to UTF-8 every time text needed to go through the layout engine would kill us in benchmarks.

For new APIs in which legacy interoperability isn't needed, I completely approve of this document.


Yeah, it's really sad the number of legacy APIs which have standardized on UTF-16.

The Windows API calles UTF-16 "Unicode". Most Mac OS X APIs use UTF-16. JavaScript and Java both use UTF-16. ICU uses UTF-16. So while UTF-8 is technically superior in almost every way, it's going to be an uphill battle to standardize on it.

I appreciate that new languages like Rust and Go made the choice of UTF-8 as their native text encoding. But there's a lot of inertia for UTF-16, and I'm not sure it'll be easy to ever get free of it.


OS X APIs generally use NSString/CFString, which hide the actual encoding of the string; they can be any encoding at all internally.


While they could, in theory, hide the actual encoding, the APIs all refer to UTF-16 code units as "characters;" so while they could use UTF-8 as the internal encoding, you need to use and understand UTF-16 in order to interact with them properly. When you ask for the "length" of a string, you are told the number of UTF-16 code units. When you get a character at an index, you get the UTF-16 code unit. That's what I mean when I say the APIs use UTF-16; everything in the API that deals with individual "chracters" is actually referring to UTF-16 code units.

The same is true of JavaScript; while you could technically implement the strings however you want, the APIs are all oriented around UTF-16 code units. And the Windows API as well, is all built around UTF-16 code units.

The problem with all of these APIs is that they make the mistake of conflating characters and code units. They all make the assumption that a character consists of a single, fixed width integer, of some given size (16 bits in the case of UTF-16). It is better to distinguish between indexing in code units (such as bytes in UTF-8 or 16 bit integers in UTF-16) and indexing in code points, or glyphs, or whatever higher level concept you are talking about. Really, for anything higher than the code unit level, you should be dealing with variable-length strings, and not try to force that into fixed length units. With UTF-8, there's no temptation to treat a single code unit as being an independently meaningful entity, as that assumption breaks down as soon as you get past the ASCII range; while with UTF-16, it's easy to make that mistake, since it holds true for everything in the Basic Multilingual Plane, which contains most characters you're likely to encounter on a day to day basis.


Some old languages have also made that choice, albeit recently in Python's case:

http://www.python.org/dev/peps/pep-3120/


However, when it comes to the internal representation of text, things are quite complex as of Python 3.2: http://www.python.org/dev/peps/pep-0393/


ICU has some UTF-8 functionality. But ICU is really ugly.

http://userguide.icu-project.org/strings/utf-8


As long as you don’t care about errors, converting to UTF-8 is quite fast. Just use native calls:

  var decode = function (bytes) {
    return decodeURIComponent(escape(bytes));
  }
  var encode = function (string) {
    return unescape(encodeURIComponent(string));
  }
(Definitely test it before wailing about benchmarks. My guess is that whatever else you’re doing is likely much slower.)

If you do care about errors, or especially if you need to deal w/ UTF-8 streams that might be chopped mid-character, use something like https://github.com/gameclosure/js.io/blob/master/packages/st...


I think you misunderstand -- I'm referring to the actual systems-level implementation of the browser engine itself. I'm not talking about the implementation of web apps.

Consider a pattern like this: A page calls document.createElement(), adds a large text node (say, the collected works of Shakespeare in text form) to it, calls window.getComputedStyle() on that element, then throws the element away. This series of DOM manipulations must go through the layout engine. If the layout engine knows only UTF-8, then the layout engine has to convert the collected works of William Shakespeare from UTF-16 to UTF-8 for no reason (as it needs an up-to-date DOM to perform CSS selector matching for the getComputedStyle() call). There is no reason to do that when it could just use UTF-16 instead and save itself the trouble.


What part of that process requires UTF-16? JavaScript doesn't require UTF-16; it just requires Unicode. You could use UTF-8 in your JavaScript implementation as well.


> What part of that process requires UTF-16? JavaScript doesn't require UTF-16; it just requires Unicode. You could use UTF-8 in your JavaScript implementation as well.

Actually, JavaScript _does_ require UTF-16. From the ES5.1 spec:

> A conforming implementation of this Standard shall interpret characters in conformance with the Unicode Standard, Version 3.0 or later and ISO/IEC 10646-1 with either UCS-2 or UTF-16 as the adopted encoding form, implementation level 3. If the adopted ISO/IEC 10646-1 subset is not otherwise specified, it is presumed to be the BMP subset, collection 300. If the adopted encoding form is not otherwise specified, it presumed to be the UTF-16 encoding form.


JS does require UTF-16, because the surrogate pairs of non-BMP characters are separable in JS strings.

'<non-BMP character>'.length == 2 '<non-BMP character>'[0] == first codepoint in the surrogate pair '<non-BMP character>'[1] == second codepoint in the surrogate pair

Any JS implementation using UTF-8 would have to convert to UTF-16 for proper answers to .length and array indexing on strings.


Still doesn't mean that it has to store the strings in UTF-16. And a bit of common-case optimizing would allow ignoring that case until a string actually ends up with a non-BMP character in it.

Alternatively, if a JavaScript implementation chose to completely ignore that particular requirement, I'd guess that approximately zero pages (outside of test cases) would break, and a few currently broken pages (that assumed sane Unicode handling) would start working.


This is horrifying. :-(


If stuff like length and indexing depend on the actual underlying representation, the implementation is broken regardless of the actual format used.


Change "implementation" to spec and I agree with you.

An implementation that fails to meet the spec can be argued as not broken, even if said spec is broken.


I agree with you in concept, but the Web is actually not strictly tied to javascript. In the short term (say, next decade), JS is probably not going anywhere, but in the longer term I hope someone creates a more well-thought-out replacement. (And for the record, I kind of like javascript, just a few things I would change with it.)


Yes, like a proper VM that we can compile our preferred language to. It is ridiculous that the "language of the web" is something that is forced upon us instead of chosen. The number of languages that compile to JS illustrate the point nicely.


The likelihood of Javascript becoming non-dominant in the next 20 years is pretty low, I think. Rather than replacing it wholesale, you'll simply see it turn into something better over time.


A recent discussion about this involving Brendan Eich and a few other JS folk: https://gist.github.com/1850768


> converting to UTF-8 every time text needed to go through the layout engine would kill us in benchmarks

So what? Is your goal to create useful software, or win at worthless benchmarks?


Well, we're talking about DOM manipulation performance here. Pages that use DOM manipulation heavily will see a potentially-unacceptable performance loss if text always has to be converted to UTF-8.

Is fast DOM manipulation important? Given that the only way for the sole scripting language on the Web to display anything or interact with the user is through DOM manipulation, I think it's worth optimizing every cycle...


http://www.utf8everywhere.org/#faq.cvt.perf

If the function you're calling with UTF8 is non-trivial, converting a few dozen bytes is unlikely to make a significant difference. Benchmark it, of course, but don't be surprised if you don't need to care. Modifying the DOM is probably going to be non-trivial.


UTF-8 -> UTF-16 conversion was at one point a noticeable fraction of Firefox's startup time.

https://bugzilla.mozilla.org/show_bug.cgi?id=506431

Since then we've done things such as fast-path ASCII -> UTF-16 conversion with SSE2 instructions. Converting a few dozen bytes is unlikely to make a significant difference, but often one needs to deal with more than a few dozen bytes.


I believe the authors wrote it clear enough that they don't rule out UTF-16 completely:

> We believe that all other encodings of Unicode (or text, in general) belong to rare edge-cases of optimization and should be avoided by mainstream users.

So if you're writing a browser that must use UTF-16 in its Javascript engine due to dumb standards... it is a reasonable performance optimizations to use UTF-16 for your strings. But how many people write Javascript engines?


It's not the conversion that kills you, it's the memory allocation. If you're trying to interact with a "chatty" UTF-16 API, the difference between using UTF-8 and UTF-16 in your implementation could the difference between passing a pointer (a cycle or two) and allocating/deallocating a buffer (potentially hundreds of cycles) per call.

There are obviously ways around this. The API could be rewritten to exchange strings less frequently or use static strings that could be replaced with handles. You can try to be clever about your buffer allocation and share one amongst all calls (but watch out for threading issues!) You could write your own allocator. But all this plumbing just increases complexity and the risk of bugs, along with adding its own performance cost.

I'm not arguing against UTF-8 as the preferred encoding for many future applications, but the "minimal overhead" example given in the manifesto isn't particularly convincing.


I agree in principle. But "chatty" APIs usually work with short strings, in which case you can use stack allocations in those performance critical calls.


See my comment above; I can construct cases that make this sort of conversion have unacceptable overhead.

Would it matter in a real-world setting? I can't say for sure, because nobody I know of has tried making a production-quality UTF-8 web layout engine. But, in my mind, none of the benefits of UTF-8 (memory usage being the main one in a browser [1]) outweigh the performance risks of doing conversion. And the risk is real.

[1]: Note that you still need UTF-16 anyway, for interoperability with JavaScript. So using UTF-8 might even lead to worse memory usage, due to the necessity of duplicating strings, than a careful UTF-16-everywhere scheme that takes advantage of string buffer sharing between the JS heap and the layout engine heap would.


Is string storage really that big a portion of the browser's memory usage? I find it hard to believe that my browser is currently storing nearly a gigabyte of text right now I'm not saying that there wouldn't be performance overhead, but would it be significant? I would be surprised if it made a big difference.

For legacy reasons, there probably isn't a point in changing it. But I'd be surprised if performance reasons turned out to be gating.


Claims like this are exactly the reason why we will still be stuck with multiple useless encodings in 2030.

I wonder how long it will take until people find their balls and decide to move towards the right direction.


Personally, I prefer UTF-8 as well. However, I think this whole debate about choice of encoding gets blown out of proportion.

Consider the following diagram:

                               [user-perceived characters] <-+
                                            ^                |
                                            |                |
                                            v                |
                  [characters] <-> [grapheme clusters]       |
                       ^                    ^                |
                       |                    |                |
                       v                    v                |
      [bytes] <-> [codepoints]           [glyphs] <----------+
Choice of encoding only affects the conversion from bytes to codepoints, which is pretty straight-forward: The subtleties lie elsewhere...


Disagree

"UTF-16 is the worst of both worlds—variable length and too wide"

Really, the author tries to convince the reader, but it's not that clean cut.

One of the advantages of UTF-16 is knowing right away it's UTF-16 as opposed to deciding if it's UTF-8/ASCII/other encoding. Sure, for transmission it's a waste of space (still, text for today's computer capabilities is a non issue even if using UTF-32)

"It's not fixed width" But for most text, it is. Sure, you can do UTF-32 and it may not be a bad idea (today)

Yes, Windows has to deal with several complications and with backwards compatibility, so it's a bag of hurt. Still, they went the right way (internally, it's unicode, period.)

"in plain Windows edit control (until Vista), it takes two backspaces to delete a character which takes 4 bytes in UTF-16"

If I'm not mistaken this is by design. The 4 byte characters is usually typed as a combination of characters, so if you want to change the last part of the combination you jut type one backspace.


> One of the advantages of UTF-16 is knowing right away it's UTF-16 as opposed to deciding if it's UTF-8/ASCII/other encoding. Sure, for transmission it's a waste of space (still, text for today's computer capabilities is a non issue even if using UTF-32)

First of all, if you don't know the encoding, then you don't know the encoding, and you will need to figure out if it's UTF-8, UTF-16, ISO-8859-1, etc. If you happen to know that it's UTF-16, you still need to figure out if it's UTF-16BE or LE.

> "It's not fixed width" But for most text, it is.

This is a dangerous way of thinking. One of the big problems with UTF-16 is that for most text, it is fixed width; so many people make that assumption, and you never notice the problem until someone tries to use an obscure script or an emoji character. This means that bugs can easily be lurking under the surface; while with UTF-8, anything besides straight ASCII will break if you assume fixed width, making it much more obvious.

> Sure, you can do UTF-32 and it may not be a bad idea (today)

UTF-32 isn't really meaningfully fixed width either. Sure, each code point is represented in a fixed number of bytes, but code points are not necessarily the interesting unit you want to index by. A glyph could be composed of several code points. Most of the time, you actually want to deal with text in longer units such as words or tokens, which are going to be variable width anyhow. The actual width of individual code points is only really of interest to low-level text processing libraries, not most applications.


"First of all, if you don't know the encoding, then you don't know the encoding"

True. But as you said, you have to know if it's BE or LE on UTF16. And there are ways to determine that automatically. (or it's on the same platform so it doesn't matter). With "ASCII compatible" codes, you can't.

I guess the main issue to me is that UTF-16 is not "ASCII compatible" so you know it's a different beast altogether.

And don't worry, I'm not assuming UTF-16 is fixed width. One should use the libraries and not try to solve this 'manually'.

About UTF-32 think: CPU registers and operations. Working with bytes is inefficient (even with the benefit of smaller size).


> True. But as you said, you have to know if it's BE or LE on UTF16.

Yes, with UTF-16 you need to know not just the encoding, but also the endianness. That makes UTF-16 worse, not better.

UTF-16 is really the worst of all possible worlds: tons of wasted space with all the complexities of variable encoding without fixed endianness.


> "It's not fixed width" But for most text, it is.

What does this mean/prove? I certainly hope you don't intend this to mean "so just pretend in your code that it will always be fixed width". And if that's not what you mean, then I don't know what you gain from that statement.


"One of the advantages of UTF-16 is knowing right away it's UTF-16 as opposed to deciding if it's UTF-8/ASCII/other encoding."

It is actually not that simple. By using UTF-16 you already have at least two problems:

1. You should know if byte order is big endian or little endian.

2. You should know if your API supports whole unicode set or only 65536 symbols. E.g. Windows API. Do you know answer? What will happen if your user wants to abuse your system by using symbols outside those 65536.


I think the author's basic point is that if we standardize on utf-8, that "8 bit anxiety" goes away.

I did my first programming assignment on punched cards, so I probably have permanent ASCII/EBCDIC brain damage. However, this article decisively convinces even me that utf-8 wins and other encodings represent fail.


still, text for today's computer capabilities is a non issue even if using UTF-32

That obviously depends entirely on what kind of application we're talking about. Keeping large amounts of text data in memory as efficiently as possible is one of my greatest concerns. Many people are processing lots of text nowadays, more than ever before.

"It's not fixed width" But for most text, it is.

True, so ignoring it means that your code will be correct ... most of the time.


"Keeping large amounts of text data in memory as efficiently as possible is one of my greatest concerns"

Depends on what you consider efficiency. If it's size, sure, store it using UTF-8. But if you're worried about speed, then UTF-16 or 32 may be the way to go, since you're dealing with data that fits a CPU register. For example, on ARM comparing one byte is much more work than comparing one 32-bit value.

"True, so ignoring it means that your code will be correct ... most of the time."

No, not going to ignore it! But on UTF-16 more code points match the UTF-16 encoding of it (easier for debugging)


Markus Kuhn's web page has a lot of useful UTF-8 info and valuable links (e.g. samples of UTF-8 corner cases that people often miss).

http://www.cl.cam.ac.uk/~mgk25/unicode.html


This is a great resource; it was extremely useful when I was writing a UTF-8 library myself. I found the UTF-8 stress test file is particularly useful to run tests against: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt


Hmm, TextMate has problems with "5.2 Paired UTF-16 surrogates" in that stress test file.

(Yes, I interpreted the file as UTF-8 in TextMate).


Totally agree re: UTF-8 vs other Unicode encodings.

But are there still still hold-outs who don't like Unicode? Last I heard some CJK users were unhappy about Han Unification: http://en.wikipedia.org/wiki/Han_unification


I am a Korean user (K in CJK), and no one, I repeat, no one, care about Han unification here.

I heard that it is different in China and Japan though.


Probably because modern Korean text is Hangul, which is not really derived from the Han characters Chinese and Japanese have in common.

http://en.wikipedia.org/wiki/Hangul

http://en.wikipedia.org/wiki/Chinese_characters


Hanja is widely used in modern Korea.

http://en.wikipedia.org/wiki/Hanja


The main problem is that it means sort-by-unicode-codepoint puts things in a ridiculous order in japanese/korean. I kind of wish UTF-8 had the latin alphabet in a silly order, so that western programmers would realise they need to use locale-aware sort when sorting strings for display.


This is false. UTF-8 sorts Korean almost correctly. For practical purposes, you can use sort-by-unicode-codepoint to sort Korean.


I spoke with several Japanese people who said that some valid characters are not representable in Unicode.

That means that it's not just a technical problem (expensive sort routines or inefficient encodings) -- it's a semantic problem.


The way I've heard it explained, there are some historical alternate versions of some characters (A Latin-alphabet equivalent might be the way we sometimes draw "a" with an extra curl across the top, and sometimes without) that have the exact same semantic meaning, and so they were 'unified' to a single code-point. Unfortunately. some people spell their names exclusively with one variant or the other, and Han unification makes that impossible in Unicode.


isn't the real problem that you can't guarantee correct rendering of ideograph text without specifying fonts? there are japanese kanji that are drawn differently from the chinese hanzi they're descended from, but they're the same from a unicode perspective.

imagine if roman, greek, cyrillic, hebrew (aramaic), and ethiopian (ge'ez) were all assigned to the same group of code points and distinguishable only by font--they're all just variants of phoenician, after all....


Do you think that sort-by-unicode-codepoint is good enough to use for technical contexts where most content is english or at least represented by the Latin alphabet? For example, do you think it's a valid choice to sort by codepoint for Java symbol names in a code refactoring tool?

I ask because I expect that sort-by-codepoint is an order of magnitude more efficient.


It's not a valid choice for anything that actually uses unicode. E.g. if I have functions caféHide() and caféShow() I expect them to be next to each other. I think Java should perhaps have required symbol names to be ASCII, but it doesn't and Java tools should deal with this.


Sort by unicode codepoint does not work well in most western languages either. It is almost only English where it works good enough to be usable among the languages with a latin script.

For example the sort order is broken for all Nordic languages.


For those who don’t know it, UTF8-CPP[1] is a good lightweight header-only library for UTF conversions, mostly STL-compatible.

[1] http://utfcpp.sourceforge.net/


That collection of best practices can hardly be considered as "UTF-8 Everywhere Manifesto" as it focuses on Windows and C++. It's good, but I'd rather see more manifesto like document for all cases on a domain like that.


I suspect this is mainly because Windows C++ programmers are the largest group that they feel need convincing. Which isn't totally their fault, Microsoft haven't done well by them by not offering good support for UTF-8; you can convert to/from it using WideCharToMultiByte but that's pretty low level, and higher-level APIs like CString will cheerfully munge UTF-8 strings for you. They also tend to conflate Unicode and UTF-16 which again doesn't help less experienced programmers realise that there might be alternatives.

I've been through the Windows Unicode stuff at a previous job, which ended up using mostly UTF-16 with some UTF-8 for interfacing to third party libraries and for files which needed to be backward compatible to ASCII (plus significant space savings, which I fought hard for). I think I prefer that approach though, since after the (difficult) conversion you didn't need to worry about encodings in 99% of the code. By their rules you'd gain significant complexity by transforming all over the place in any non-trivial GUI code.


Text is maddening, the modern Tower of Babel.

Is there a definitive reference, or small handful of references, to learn all that's worth knowing about text, from ASCII to UTF-∞ and beyond?


Joel Spolsky's 'The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)' is a good start: http://www.joelonsoftware.com/articles/Unicode.html

Like a few other specialised fields (cryptography comes to mind) the key takeaway is to use a library and rely on the work of people who know it better than you do and have handled all the subtleties already :)


I use UTF-8 for transmitted data and disk I/O, and I use UCS-4 (wchar_t on Linux/FreeBSD) for internal representation of strings in my software.

I generally agree with this article, but I disagree with it on the point that UTF-8 is the only appropriate encoding for strings stored in memory, and also I disagree on the point wchar_t should be removed from C++ standard or made sizeof 1, as in Android NDK.

Let me explain why.

In UTF-8 single Unicode character may be encoded in multiple ways. For example NUL (U+0000) can be encoded as 00 or as C0 80. The second encoding is illegal because it's longer than necessary and forbidden by standard, but naive parser may extract NUL out of it. If UTF-8 input was not properly sanitized, or there is a bug in charset converter, this may result in exploit like SQL injection or arbitrary filesystem access or something like that: malicious party can encode not only NUL, but ", /, \ etc this way.

Also UTF-8 string can't be cut at arbitrary position. Byte groups (UTF-8 runes) must be processed as a whole, so appear either on left side or on the right side of cut.

Reversing of UTF-8 string is tricky, especially when illegal character sequences are present in input string and corresponding code points (U+FFFD) must be preserved in output string.

I think UTF-8 for network transmitted data and disk I/O is inevitable, but our software should keep all in-memory strings in UCS-4 only, and take adequate security precautions in all places where conversion between UTF-8 and UCS-4 happens.

And sizeof(wchar_t)==4 in GCC ABI is not a design defect, wchar_t exists for a good reason. I admit that sizeof(wchar_t)==2 on Windows is utterly broken.


Concerning "cut at an arbitrary position" actually utf-8 is the only codec that can deterministically continue a broken stream because bytes that start a character are special.


> Also UTF-8 string can't be cut at arbitrary position.

Neither can be any other kind of Unicode string because of Combining Characters. That's why the Unicode standard (or an Annex) recommends algorithms for text segmentation.

(And if you really need to cut at a certain length then you can easily backtrack and find the beginning of the sequence by looking for the first byte with the MSB = 0)


The strangest thing about Unicode (any flavor) is that NULL, aka \0, aka "all zeros" is a valid character.

If you claim to support Unicode, you have to support NULL characters; otherwise, you support a subset.

I find most OS utilities that "accept" Unicode fail to accept the NULL character.

FWIW, UTF-8 has a few invalid characters (characters that can never appear in a valid UTF-8 string). Any one of them could be used as an "end of string" terminator if so desired, for situations where the string length is not known up front.

We could even standardize which one (hint hint). I suggest -1 (all 1s).

UPDATE: I meant "strange" as in "surprising", especially for those coming from a C background, like me.


No. NUL is backwards-compatible with ASCII, and is used everywhere. Choosing some arbitrary invalid UTF-8 byte for use as a terminator would be a terrible decision. If you want to handle NUL, simply use length-annotated slices instead of C-style NUL-terminated strings. Anything else is completely wrong.


Did you even read my comment? NULL is a valid UTF-8 character.

If specific languages and their standard libraries choose to treat it as a string terminator (C, I'm looking at you), well, then fine.

But it's still valid Unicode. If you claim to support Unicode strings, but don't support the NULL character in those strings, you don't support Unicode strings in their entirety.

Going further, there's nothing in the ASCII spec that requires NULL to only appear at the end of a valid string. That's a C language convention, AFAIK (maybe it started earlier...).


I think what you are trying to say is:

"because UTF-8 has invalid character sequences, we could potentially use one of them to represent end-of-string, which would allow us the flexibility of a null-terminated string (not keeping track of the length) without the restriction of no-nulls-allowed."

You're right! Great. But you are not revealing a "strange thing" about Unicode. You are instead making a general comment about null-terminated strings. So why use such inflammatory and misleading language like "If you claim to support Unicode, you have to support NULL characters"?

Update: I don't object to your idea at all, it's a neat trick! It's just that the way it's phrased, it sounds like Unicode's design contributed to this NULL-terminal problem, when in fact even NULL-terminated ASCII strings cannot 'handle' a null character in this sense.

To augment your idea, though, how about you use '0xFF 0x00' as a terminator? This way, backward-compatibility is preserved in all cases except UTF-8 => ASCII with NULLs, and in this case the string will be truncated rather than a buffer overflow (i.e. "fail closed").


@bobbydavid Thanks for re-stating what I'm saying.

Re: "So why use such inflammatory and misleading language like "If you claim to support Unicode, you have to support NULL characters"?"

I'm not trying to be "inflammatory" or "misleading". IMO (note: opinion), if a given API claims to support Unicode strings, but instead disallows certain Unicode characters from appearing in those strings, then, IMO, that API only partially supports Unicode strings.

Others could have different opinions (evidently, you do).

I don't know if handling all Unicode characters should be a requirement for an implementation to call itself "Unicode-compliant", but barring another standard, that seams reasonable to me.

A better option IMO would be to have never included NULL in Unicode in the first place, since it is so widely used in system software as an end-of-string terminator. But that ship has sailed...


UPDATE: Yes, I read yours. I suggested a non-NULL terminator for strings that are not length-terminated. Your alternative only applies to length-terminated strings.

You manage this by having the sending end slice things into lengths for the receiver. Is there some globally-recognized standard for doing so that I'm not aware of?

Because if there isn't, my termination proposal (an invalid Unicode byte) is just as valid as your framing protocol.


I'm sorry, did you read mine? "If you want to handle NUL, simply use length-annotated slices instead of C-style NUL-terminated strings."

Seriously though, using 0xFF as a UTF-8 string terminator would be a terrible mistake.


-1 is not a valid Unicode code point. "All 1s" is not adequately defined without saying how many 1s – and Unicode does not specify a maximum bit width. Even if you said "the maximum Unicode code point", that is not all 1s – it is 0x10FFFF.


That's the entire point of choosing -1 as an "end of sequence" marker for a UTF-8 string when the length is not known up front.

A byte containing all 1s is not valid in any Unicode encoding, so if one appears, you'd know you had hit the end of the string.


This doesn't work well with handling ill-formed sequences.

The length is of course known upfront if not of the whole string but at least of the individual small substring.


OK, I thought you meant a code point containing all 1s. Thanks for clearing that up.


> The strangest thing about Unicode (any flavor) is that NULL, aka \0, aka "all zeros" is a valid character.

This is either false or misleading, depending on what you're talking about.

In UTF-8, '\0' is all-bits-zero, one byte, and means the same thing as it does in ASCII. It cannot occur in the encoding of any other character.

In UTF-16, the byte 0x00 may validly occur within the encoding of a character that is not '\0'. The same is true for UCS-4.

This is a big reason UTF-8 is as popular as it is.


Can someone explain to me how UTF-8 is endianness independent? I don't mean that I am arguing the fact, I just don't understand how it is possible. Don't you have to know which order to interpret the bits in each byte? And isn't that endianness?


No, that’s not endianness; endianness refers to the ordering of bytes within a multi-byte value—least significant byte first or most significant byte first, generally. The order of octets in a UTF-8 code point is fixed, and because a bit is not an addressable unit of memory, the storage order of bits within an octet is immaterial.


It's endianness independent in the sense that the order in which you interpret the bytes in each character does not depend on the processor architecture, unlike UTF-16.

If your processor interprets the bits in each byte in a different order, that might be a problem, but it's not what we're talking about when we usually talk about the endianness of character encodings.

http://en.wikipedia.org/wiki/Endianness


Thank you. That is very good to learn and I looked over the wikipedia article. But as far as byte order, how is that architecture independent? Is it just that utf-8 dictates that the order of the bytes always be the same, so whatever system you're on, you ignore its norm, and interpret bytes in the order utf-8 tells you to?


Yes, basically. UTF-8 doesn't encode a code point as a single integer; it encodes it as a sequence of bytes, with a particular order, where some of the bits are used to represent the code point, and some of them are just used to represent whether you are looking at an initial byte or a continuation byte.

I'd recommend checking out the description of UTF-8 in Wikipdia. The tables make it fairly clear how the encoding works: http://en.wikipedia.org/wiki/UTF-8#Description


utf-8 is a single byte encoding. Reversing the order of a sequence that's one byte long just gives back that one byte.


No it isn't. Any letter with an accent will take up two bytes. Most non-Latin characters take up three bytes, sometimes even four.


Poorly phrased. It can take multiple bytes to fully define one codepoint, but the encoding is defined in terms of a stream of single bytes. In other words, each unit is one byte, hence flipping each unit gives back the same unit.

This is not the case for UTF-16 and UTF-32.


AFAIK the correct term is "byte oriented".


Yes but those two bytes will be in the same order regardless of the endianness of the system.


No. You never need to know how to interpret the bits in each byte; you cannot address individual bits. Endianness refers to how different bytes within a multi-byte value are addressed; little endian means that the smaller addresses refer to the lower order bytes, big endian mean that smaller addresses refer to the higher order bytes.

Think of it as which order you write the digits in a number. Each digit (byte) means the same thing regardless of whether you are big-endian or little-endian. But in big endian, you would write one thousand two hundred thirty four as 1234, while in little endian you would write it 4321.


can not agree more! it will be a much better world if we all use utf8 for external string presentation. i don't care about what your app use internally, but if it generates output, please use utf8.


I there a simple set of rules for people who currently have code which use ASCII, to check for UTF-8 cleanness?

In particular, what should I watch out for to make an ASCII parser UTF-8 clean?


If you're reading something in pieces, like a buffer that fills 256 bytes at a time, you have to be careful. UTF-8 is a multi-byte encoding so the last byte in your buffer may not completely finish a code point. Unlike older code that can just read a bunch of bytes and use them, with multi-byte encodings you have to have a way to deal with "left-overs" until new bytes show up.

Fortunately the UTF-8 encoding (e.g. see the Wikipedia page) makes it clear when a byte is the beginning of a new point and it tells you how many intermediate bytes should follow.


UTF-8 is easy to figure out. However, it's only a doorway to Unicode, and Unicode is not simple.

If you do anything more than chopping codepoints and passing them around — use a library. There's a lot of technical complexity due to quirks of Unicode and inherent complexity of world's diverse writing systems.

Avoid using concept of a "character" as much as you can, as it's fuzzy, e.g. there are combining characters and ligatures (codepoint != character).

Be aware that string comparison cannot be done just by comparing codepoints, and there are different levels of "sameness" of Unicode strings coming from different normalisations, e.g. NFC and NFKD.

Case-insensitive comparison cannot be done by lowercasing a string: http://www.moserware.com/2008/02/does-your-code-pass-turkey-...

Unfortunately I don't know much about RTL text, and there's lots of traps there too (e.g. there are control characters for controlling text direction).


If you are "parsing" a string then you will have problems unless you specifically make the code deal with unicode code points and not bytes. If you just accept a char* and then pass it on with the contents as is you'll generally be fine (except on Windows).


All 7-bit ASCII is valid UTF-8, so you're fine as long as you're really using 7-bit ASCII and not latin1 or Windows-1252, etc.


How could we avoid acronyms like 'utf-8'?

We can do better than that. Unicode8?


Just use the term "string" to refer to utf-8, and the term "data in nonstandard encoding X" to refer to other encodings.

In the article he puts in in terms of std::string, but more generally I think this is what he means.


You're confusing things. Strings cannot be utf-8 any more than you can be your signature.

"strings" are abstract data structures. They are lists of characters. Not bytes, not integers, but characters. Often, we use the Unicode character set as the set of allowable characters. There are other character sets.

Internally, strings often represent characters as integers. When using the Unicode character set, strings then use the Unicode encoding to integers (a table mapping characters to unique numbers). Sometimes we use other character sets and encodings.

Unfortunately, integers are abstract. You can't store them in a file or transmit them over a network until you pick a concrete representation as bytes. How many bits per integer? Big or little endian? Etc. That's where UTF-8 comes into play.

UTF-8 is a merely a compressed data format used to represent a sequence of integers as a sequence of bytes - a way that happens to have some properties that make it convenient for representing strings.

UTF-8 is not Unicode.

UTF-8 can also be used for other types of numerical data. As a silly example, suppose you had a list of ages of houses. Many houses are less than 100 years old. A few are more than 300 years old. An efficient serialization of that data would be to represent the ages as integers and then utf-8 encode your list of integers.

Some true statements: A character set is a set of characters. Characters are not integers or bytes. A mapping from characters to integers is an encoding. Unicode is a standard that defines a character set and an encoding to integers. Mapping integers to bytes is confusingly also called encoding. UTF-8 is an encoding from integers to bytes. Unicode defines a set of characters and an encoding of characters to integers. UTF-8 is an encoding of integers to bytes. UTF-8 is not Unicode.


I'm just saying that if we standardize on encoding, we dont need to talk about encoding. Which is my interpretation of the original document.

Separate point: There is no such thing as an abstract string or integer in a computer, no matter what language you are using. Every string in a computer has an encoding - you have to store it as ones and zeros.

If we standardize on UTF-8 as an encoding, we just dont need to use the awkward phrase "UTF-8" in ordinary conversation.


So we should call it "Unicode-8" to further confuse people about the difference between "Unicode" and "UTF-8"?


That seems kinda pointless. If somebody is comfortable with "HTML" or "C++", UTF-8 is downright friendly-sounding by comparison.


Why should we?


That page is misleading when it comes to Japanese text: UTF-8 sucks for Japanese text. UTF-8 and UTF-16 aren't the only two choices within the whole world, which is demonstrated in their choice of encoding Shift-JIS.


Can you elaborate on that? Why does Unicode suck for Japanese text?


Not only kanji, but also hiragana and katakana (syllabic alphabets) encode to three bytes per character. Shift-JIS can encode all three to two bytes, as well as half-width katakana to one byte per character.

However, if size is such a concern (eg for web transmission), text compression neutralizes the perceived benefit of region-specific encodings.

Shift-JIS' continued popularity has much more to do with change aversion than it does technical merit.


As I said above, I spoke with several Japanese people who said that some valid characters are not representable in Unicode.

Some details can be found here: http://en.wikipedia.org/wiki/Han_unification


On the web ASCII (think HTML tags, CSS stylesheets, etc) typically is a large fraction of CJK pages, so the relative inefficiency of UTF-8 for encoding is less important.


@ruediger There's nothing wrong with Unicode. UTF-8 sucks because it ends up taking more space.

@byuu No it doesn't. Try compressing a SJIS text using gzip. Then convert it to UTF-8 and do the same thing. With a "perfect" compressor, there shouldn't be any difference since the information contents are the same, but unfortunately we don't have a perfect compression algorithm that hits the theoretical lower bound for compression.


tl:dr; Use UTF-8 when you need to use unicode with legacy APIs, never anywhere else.

UNIX isn't UTF-8 because UTF-8 is better, UNIX is UTF-8 because you can pass UTF-8 strings to functions that expect ASCII and it kinda works. This is really the only thing you need to know about UTF-8 and why it's better.

There are few pieces of software that don't have to talk to legacy APIs that store strings natively in UTF-8.

C# and Java are probably the best examples of software that was engineered from the ground up and thus uses UTF-16 internally because it's much less likely to run into issues like String.length returning 32 yet only containing 31 characters. If you use UTF-8 expect this result anytime a string contains a real genuine apostrophe.

"UTF-8 and UTF-32 result the same order when sorted lexicographically. UTF-16 does not."

This is complete and utter bullshit, to sort a string lexicographically you need to decode it, if you've decoded the string into UNICODE then they sort the exact same way.

There are lots of gotchas for sorting UNICODE strings including normalization because you can write the semantically equivalent strings in unicode multiple ways. eg. ligatures.

If you're sorting bit strings that happen to contain UTF-8/32 then you're not sorting lexicographically and your results will be screwed up anyway.


> decoded the string into UNICODE

I think you are quite confused.

1) Unicode is not an acroynm. 2) You cannot "decode into Unicode". I think you mean "decode into codepoints". 3) If that is what you mean, then you are wrong about sorting: Sorting UTF-8 and UTF-32 bytestrings will indeed sort them lexicographically by code point, which was the author's point. No, that will not generally be the sort you _want_; but no amount of 'decoding' will give you the sort you want. For that you need to first normalize, and then follow the collation rules, which don't sort by raw code points at all.


"it's much less likely to run into issues like String.length returning 32 yet only containing 31 characters"

This is exactly the problem with UTF-16. Most APIs that use it will have support for string operations that return the number of codepoints rather than the number of bytes, and as a result people think that it's a solved problem. But in fact you've only solved half the problem, because the number of codepoints is almost certainly not what you want - you want the number of characters, and the only way your library functions can know that is to know which Unicode codepoints are combining characters. And that set potentially gets larger with every new Unicode release.

In other words, if you're relying on languages that have native UTF-16 support to tell you the number of printable characters, your application is inevitably going to be broken the first time someone uses a newly-defined combining character. UTF-16 buys you absolutely nothing useful in this respect.

(Example: How many characters is "é"? "é"? "é"? Does UTF-16 give you a more useful answer to that question?)


> (Example: How many characters is "é"? "é"? "é"? Does UTF-16 give you a more useful answer to that question?)

In case anybody was wondering:

1. The first "é" is an "e" followed by a combining acute accent.

2. The second "é" is a single code point for a lowercase-e-with-acute.

3. The third "é" is the same as the second "é", but with a zero-width non-breaking space in front of it.

All of these are, of course, the same letter.


UTF-8 is explicitly designed in such way that unix API dont have to care about it and such that lexicographic ordering of utf-8 encoded byte streams is same as lexicographic ordering of unicode code point vectors (which arguably almost never is what you care about when sorting text strings). Both of these features are result of conscious design and not some random coincidence.

As for sorting text strings of any kind (which has mostly nothing to do with unicode), you have to care about what user expects, which is dependent on locale and sometimes even on user's preference. Algorithms to do text sorting are for each separate locale mostly non-trivial, but fortunately can be generalized into one relatively straight-forward algorithm (similar to unicode normalization) involving pretty large database and some special cases.

When you are sorting bit strings you are sorting lexicographically, but that is usually not what you should be doing.


"When you are sorting bit strings you are sorting lexicographically, but that is usually not what you should be doing."

That's my point about it being bullshit, it sounds like a feature you might want, but honestly it's useful to very few people.


Man, the article doesn't try to argue for or against UTF-8 based on lexicographical order. It's in the "facts" section, and it's a correct fact. So it's just as neutral as "Widechar is 2 bytes [...] 4 on others." or "In both [...] characters may take up to 4 bytes".


> "UTF-8 and UTF-32 result the same order when sorted lexicographically. UTF-16 does not." This is complete and utter bullshit, to sort a string lexicographically you need to decode it, if you've decoded the string into UNICODE then they sort the exact same way.

One original purpose of Unicode, still mentioned in the published for v6.0, is to assist the 100-odd other encodings, not replace them. Each other encoding only needs a conversion process to and from Unicode, i.e. 200 conversion processes overall, instead of into every other encoding, i.e. 9900 processes overall. For sorting, Unicode is order-invariant. The text should be converted into a relevant country-specific encoding, sorted within that encoding, then converted back to Unicode.


Your tl;dr is misleading, doesn't represent the thrust of the article, cherry picks nits, and makes assertions that are contradicted with evidence in the article (e.g. UTF-16 is not fixed length).


Java does NOT implement UTF-16, nor do most systems that claim to.

It's UCS-2, which is its own even worse pile of bullshit.


> UNIX isn't UTF-8

These are the only three words that are correct in your comment. You're spreading a lot of FUD about UTF-8, and in many cases, you are completely incorrect about many things technical.

I hate to make this personal, but, you really need to investigate what you're talking about before jumping on HN and talking shit about UTF-8. UTF-8 has some criticisms, but none of what you have written so far is even remotely valid. I hope people realize this instead of get scared away which is, interestingly, what the point of the manifesto linked above is all about.


ASCII and UTF-8 are too US centric. That's why adoption in places like China is so low.

Also, if there's variable length encoding why can't we just do a proper way and improve size for the same computational cost?


The author makes a compelling case for UTF-8 in Asian languages.

I'd love to hear any specific counter-arguments.


No he doesn't, he dismisses it out of hand by choosing an example that is uniquely suited to minimize the advantages of UTF-16 for non-Roman scripts. Precious little of an HTML document is actually textual content.


HTML documents are hardly unusual examples. Also, look at the other column in that table, where he stripped out the HTML tags and looked only at the body text: UTF-16 was somewhat smaller, and gzipping them made the difference negligible.

Does UTF-16 really have such a great advantage for non-Roman writing systems? Or is this motivated more by a disliking for Anglocentrism?


Precious little of the data stored in the world is textual content. It's hard to believe that choice of encoding standards materially impacts RAM or disk budgets.

Most of his points are not related to encoding size but to simplicity and standardization. I find those reasons to be very compelling.

I'm asking for clear counterarguments because I concede that my ASCII background could predispose me to UTF-8. Please be a little clearer, I really do want to know.


I think that's the point; In the real world, any significant chunk of non-roman text is embedded in far more roman text, or in terms of internals of programs, is generally dwarfed by the size of other data structures.

More or less, I'd say that storage size of text usually doesn't matter.


Did you read the article, including the part about Asian text? Like it or not, most text these days is embedded in markup languages like XML or HTML, in which all of the markup is within the ASCII range. This, coupled with the fact that UTF-8 gives you a factor of 2 savings over UTF-16 for the ASCII range, while only a factor of 1.5 increase over UTF-16 for CJK characters, means that for much text (such as anything on the Web), UTF-8 is actually smaller than UTF-16 even for CJK text.

Yes, ASCII is obviously too US centric; you can't encode any writing systems other than the Roman alphabet in ASCII. However, that's not at question here. The question is, which Unicode encoding should you use, so you can represent all writing systems in a single encoding. And the major contenders are UTF-8 and UTF-16. The point of this article is, for that purpose, UTF-8 is a far better choice.

> Also, if there's variable length encoding why can't we just do a proper way and improve size for the same computational cost?

What do you mean by a "proper way"? If size is what you care about, just compress your data. Compression will do a lot better for a much wider range of data than some clever encoding will. UTF-8 is a carefully constructed encoding designed to meet several design criteria. For instance, you could get better size for a wider range of character sets by having a single byte to represent switching between character sets; so you could use that byte, and then a whole bunch of 2 byte CJK characters. But that would defeat one of the design goals of UTF-8, which is to be self synchronizing. That means that if you get a partial sequence (such as a sequence that has been truncated), you can start decoding the characters after a fixed number of bytes. In the case of UTF-8, you will never have to go more than 3 bytes before you can start decoding again. In my hypothetical scheme where certain symbols were used to switch between character sets, you would not be able to interpret anything until you found the next such symbol. This makes UTF-8 more robust in the face of errors.

Another design goal of UTF-8 was to be backwards-compatible with ASCII. Like it or not, ASCII has been the standard encoding for decades, and there is a lot of text in ASCII and a lot of software that uses ASCII delimiters and the like.

So, while it would be possible, in theory, to define a character encoding that is more "fair" than UTF-8, that ignores many of the other goals of the design of UTF-8. And UTF-8 is widely supported and used (it is the most popular encoding on the Web, even in places like Japan, and a close second in China), while a new encoding would require another large, global, and painful transition process to introduce.


The author compares UTF-8 to UTF-16, while there are a myriad better encodings than both for different Asian languages.

For instance: EUC-JP in Japan, BIG5 in Taiwan, GB in China, etc. Different EUC encodings are variable-length and are a lot more efficient since they put common subsets in lower parts of the table for each language, close to each other so they also compress better, while allowing tricks for text matching and searches (not really necessary for web sites, but it's nice to use the same encoding throughout applications sometimes). Russian and Greek are also basically multiplied by two in size.

There are a lot of other considerations.

If you think a 30%+ saving in size (and latency) is not a big deal, then you're a lot more likely to lose to local competitors. Note that gzipped or otherwise compressed text makes differences even worse at least in the case of Japanese - where UTF-8 text gets de-aligned all over the place to odd byte sizes and compresses worse. Add to that the fact that Asians browse the net A LOT from the phone and have done so for much longer than westerners and in a bigger percentage, and you have your problem exacerbated even further.

There is a lot more to consider and like it or not it's not as simple as "UTF-8 for everything and everybody, ever!"


> Note that gzipped or otherwise compressed text makes differences even worse at least in the case of Japanese - where UTF-8 text gets de-aligned all over the place to odd byte sizes and compresses worse.

That was not what this article measured. UTF-8 and UTF-16 compressed to virtually the same size.


Have you ever worked with a system that needs to deal with more than one language at the same time? What if your users want to mix Japanese with Russian in the same sentence? Or Japanese and simplified Chinese? (Yes, people do that.)

In the global Internet, UTF-16 and UTF-8 are the only games in town.


All the damn time I'm using several languages.

Then UTF (and EUC's) are the way to go.

It's not like you have to use the same encoding all the time.


> It's not like you have to use the same encoding all the time.

Then you are going to feed someone garbage. Why feed people garbage?


??

Not if you know what you're doing. Not any more than using utf8 exclusively all the time and for all purposes.


> Not if you know what you're doing.

This is nice in theory. In practice, people make mistakes. Make it easy on yourself.

> Not any more than using utf8 exclusively all the time and for all purposes.

Maybe I was unclear: Feeding me Chinese text in UTF-8 is not garbage. Feeding me anything in one of the GB encodings is garbage.

Garbage, to me, is text in an encoding I can't handle. If you only use UTF-8, that cannot possibly happen.


> What do you mean by a "proper way"? If size is what you care about, just compress your data. Compression will do a lot better for a much wider range of data than some clever encoding will.

That's precisely what I meant. Simple variable-length compression.



The new HN: disagree = downvote


It's not really new.

Whether you agree with the sentiment or not, pg's oldish comment on the issue at least establishes the existence of the behavior several years ago, and has been taken by many to be the final word on the acceptability of the practice: http://news.ycombinator.com/item?id=117171

(edit: spelling)


Paired with increasing groupthink, HN discussions are getting boring.


Strings (NSString) on Apple platforms are UTF-16. The Apple platforms are not exactly lagging behind in either multilingual, or text processing. I wonder what this team of three people knows that Apple doesn't? Or is it the other way around, that Apple knows something they don't, and when it comes to shipping products that work in the real world, Apple has figured out how to do it?


As the authors explain in the post, many things use UTF-16 internally (Python, Java, C#, etc...). It does not mean that it's the best solution.

Have you read the article? They make a good explanation at why UTF-16 is "the worst of both worlds" (wide characters AND variable lenght).


NSStrings are opaque--you always call accessor functions and never have access to the low level backing store. The reason they are good is that you can't get data into or out of them without specifying an encoding, which leaves the actual encoding of the backing store as an implementation detail.

The fact is, I don't even know (or see documented) that the backing is UTF-16--Apple is free to change that at their whim and no user programs would break.


It's not documented (presumably) for that very reason.

In fact, the opposite is implied by initWithBytesNoCopy:length:encoding:freeWhenDone: - it should be possible right now to have NSStrings with arbitrary internal representations, even if most other creation methods currently convert to UTF16.


Unfortunately, this isn't entirely true.

  - (unichar)characterAtIndex:(NSUInteger)index
will return the UTF-16 "characters" (codepoints or half-codepoints)


NSString is decades old, from NeXTSTEP (you can see that in the name: "NS"). While its possible they could change it, the in-memory representation doesn't matter much in this case. When you transfer data out, such as with writeToFile:encoding: or convert it into a char * (often with UTF8String, or one of the C string methods), you are almost always specifying an encoding anyway. And, for most Cocoa apps, that encoding is UTF8.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: