Hacker News new | past | comments | ask | show | jobs | submit login
How To Use UTF-8 Throughout Your Web Stack (rentzsch.tumblr.com)
79 points by shawndumas on Aug 19, 2011 | hide | past | web | favorite | 42 comments

He's missing the most important part: ensuring that your application code is treating the text as text, not as an octet stream. This varies by language, but typically the code is something like "text = decode('utf8', binary)" when your application first sees data from the wire (or files, or a URI string, etc.), and "binary = encode('utf8', text)" when the data leaves your program, like to a log file or the terminal or a socket.

I say "binary" and "text" because the Internet cannot transmit text, it can only transmit "binary" octet streams. (Similarly, UNIX files can only store octets, and UNIX file names can only store octets other than / and NUL.) But, your programming language supports both text manipulation and binary manipulation, so you have to tell it how you want to treat the data. Each language is different; Perl treats everything as Latin-1 text by default (which happens to work nicely for binary, as well, but not so nicely for UTF-8-encoded text).

Often, libraries will handle this for you, since they have access to out-of-band information. If your locale is en_US.UTF-8, filenames can be assumed to be UTF-8-encoded. If the HTTP response's content-type says "charset=utf-8", your HTTP library will know to decode the octet stream into text for you. But it's important that you both test this and find the code that does it for you, because sometimes library authors forget or libraries have bugs, and one bug will ruin your whole operation.

Handling Unicode text is hard because it's a rare case where you have to get everything right or the results of your program will be undefined. And, there are no "reasonable defaults", so you have to be explicit about everything. Finally, you can't guess about what encoding your data is; all binary data must come with an encoding out-of-band, or your program will break horribly. Proper text manipulation is the ultimate test of "can I write correct software", and it isn't easy.

I agree on most of your points, but disagree on that guessing encoding should not be done. I think that it conflicts with basic robustness principle "be conservative in what you do, be liberal in what you accept from others".

I personally think being liberal in what you accept from others is the second worst evil in computer science. The worst being null, of course.

I agree. It allows sloppy developers to be liberal in what they do, and leads to increasingly complex (and incompatible) implementations necessary to be compatible with all the edge cases.

HTML is a good example. Browsers are very tolerate of malformed HTML, which is nice for beginners who don't want to worry too much about perfect syntax.

The problem is each browser handles the unspecified cases differently, which leads to differences in the way pages are rendered, security issues like XSS, etc.

Robustness should just be built into the protocol/format/spec, if necessary. HTML5 gets this right by specifying an algorithm that all parsers should use to get consistent behavior, while still being tolerant of imperfect syntax: http://en.wikipedia.org/wiki/Tag_soup#HTML5

Hey now. If software started validating its input, what would virus writers do for a living?

Then you will also personally produce programs which would be broken for 5/6ths of the world population who happen to use letters outside latin1.

There's no way to avoid it unless you wrap it up and add some explicit checks and guesses.

Won't all modern browsers include the encoding in the Content-Type header?

They should. If so there's no need to guess.

It's not just browsers. Browsers are pretty sane when it comes to charsets, because they had the time to make it right and the pressure to do so. (It wasn't like that in times of NN4/IE4, which would interpret your text as whatever they want and won't even let you override)

Facing to something less agamant (like dreaded id3 tags), no such luck.

This is probably a good time to re-link to "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets" by Joel.


This post got me wondering: what would a character set not requiring backward compatibility with ASCII look like? I tried to put the question to the StackOverflow folks, but it was apparently off topic:

Thoughts on a non-ASCII encumbered character set? What can be improved over UTF-8?

The obvious 'native' format for Unicode data is UTF-32, which forgoes ASCII-compatibility and gets you regular, consistent sizing. Unfortunately, Unicode itself is an ASCII-encumbered character set (or at least Latin-1-encumbered, and Latin-1 is ASCII-encumbered). Pretty much all the contents of Unicode are there because somebody needed them¹, so I imagine a non-ASCII-encumbered Universal Character Set would still contain all the ASCII characters... somewhere... but they'd likely be shuffled around a bit from the Unicode we know and love.

¹: I believe Unicode encodes some meaningless Kanji/Hanzi glyphs that were created by accidentally confusing two other, genuine glyphs; I'm pretty sure it only does so because it inherited them from legacy pre-Unicode encodings.

Is it UTF-32LE or UTF-32BE?

The good thing about UTF-8 is that you don't have to choose.

what would the web look like if we didn't care about backwards compat? ...who cares?

How terribly unconstructive of you.

1. Sarcasm is not wit.

2. Dismissal by analogy is no dismissal at all.

3. The polite individual addresses well intentioned questions in good faith or not at all.

Now, I've asked my question because, in a community of peers of various stripe, it's entirely possible that there will be members more versed in character encodings than myself and that there are, indeed, more optimal ways to encode characters which are not taken for need to maintain backwards compatibility with the ASCII character set. One poster mentioned UTF-32 as a non-backwards compatible example which, while space inefficient, is a good-faith answer to my question.

As to your clumsy dismisal by analogy question, here are some things off the top of my head:

* Substitute byte-code standards for browser javascript.

* Specify parsing, rendering semantics from the start for all markup/presentation languages.

* Automated browser compliance testing from the start for all standards.

* Effective client-side user storage.

* Include stateful communication channels from the word go.

Innovation in a field of endeavor only occurs by examining the assumptions of the field and invalidating them as the context of their being changes. Those that don't care to do so are absolute hacks, stirring the sewer's mirk in the hopes that a shiny bobble will ocassional bubble up from the depths.

sorry, should have been more careful with my tone. i only meant that improving UTF-8 is interesting CS research but not terribly useful in real life.

Your premise does not necessarily grant your conclusion without more elaboration. How it does research necessarily not have a benefit for 'real life'? Or, to put it another way, why is it basic research is, by default, divorced from 'real life'?

Presumably if there were a more efficient character encoding with sufficient advantages there would indeed be some 'useful' aspect to it. However, and I'd like to drive this point home, you have no business deriding the well-intentioned questions of others if you have nothing to contribute. Thus far you've outright dismissed even the validity of raising the questions--'who cares'--and, after being confronted, backpedaled somewhat but still dismissed the question's utility--'not terribly useful in real life'--without being so kind as to explain why, as if it were self-evident. Perhaps to you, but this somewhat the point of my last harangue: innovation starts with questions, even those which are seemingly naive.

You have been, to this point, not uncommonly, sadly, but rude indeed. If you have thoughts to share on the subject I would love to hear them. To contribute only blase dismissal tends to make individuals of less than adamantian character cease or hide their well-natured questioning of the world about them. I do not mean to suggest, of course, that every question posed should be met with twee praise: no, indeed. Rather, explorations should be met with good faith, disagreements elaborated such that rational observers might find in the conversation new ideas, or strengthening of their own. Our individual actions steer the culture in which we find ourselves; I hope we can both agree that it is a better world in which basic exploration is the norm; neither dogged travel in a well-worn rut if indeed a better road might be found.

I don't want to gloat, but when I dump my database in postgres and then restore the dump, I don't have to concern myself with character sets and --default-encoding parameters because the dump will contain a "set encoding" statement that corresponds to the files content.

Aside of that, IMHO, the database is the least problematic part of the chain as long as you tell the client library what character set the incoming data will be in. It then should transcode automatically if needed.

One additional thing: I once witnessed MySQL silently truncating Latin data I accidentally tried to store in an utf-8 table, so you might to be a bit careful. Usually you should just get an error if you tell the database that your data is utf, but it isn't (http://pilif.github.com/2008/02/failing-silently-is-bad/)

Lastly, IMHO, the biggest issue is, as usual, the browser: to this day it's possible to have IE submit data in ISO-* (depending on the users locale) despite clearly stating to only accept utf. Be mindful of this and fix the encoding if you can (or have the database blow up - see above)

Actually utf8 in MySql is 16 bit utf8 ie compatible with the obsolete UCS2 and you should be using utf8mb4 for real utf8. http://dev.mysql.com/doc/refman/5.6/en/charset-unicode.html

Thanks for the warning! I know the MySQL developers have a huge challenge to maintain compatibility with existing data and clients, but it almost seems like they're trying to make Unicode difficult to use correctly when they: default to Latin1 encoding; call it something accurate yet surprising like "latin1_swedish_ci"; silently convert and corrupt data if the client connection specifies a different encoding; add a Unicode encoding called "utf8" but don't support everything that other UTF-8 does; and finally add real UTF-8 support but call it something different like "utf8mb4".

MySQL seems to be the cause of many of the web's character encoding problems; search for keywords like "mysql utf8 hell" and you will get many hits.

Are the MySQL developer doing anything to improve the situation? Here are some possible improvements of the top of my head. I won't call them solutions. <:)

  1. Use UTF-8 internally, converting based on the client's encoding.
  2. Use UTF-8 internally and force client's to do their own conversions.
  3. Tag all string data with its encoding type for run-time checks.
  4. At the very least, default to UTF-8 rather than latin1-swedish-ci!

There's many utf8 collations so you should specify something specific to default to. I recently read to use utf8_unicode_ci [0] is the best.

[0] http://philsturgeon.co.uk/blog/2009/08/UTF-8-support-for-Cod...

It is probably worth including a character outside the basic multilingual plane (e.g. anything above 0x10000, like http://unicodelookup.com/#0x22222/1) when testing UTF-8 web support. I recently was working on a Japanese teaching web application that needed such characters and sadly learned that MySQL versions before 5.5 do not support UTF-8 characters outside of BMP (anything that needs more than 4 UTF-8 octets) and text to image drawing library support was also sketchy.

4 octets of UTF-8 suffice to cover all Unicode characters. Unicode is essentially 21-bit (U+0 to U+10FFFF), not 32-bit. The BMP is 16 bits, U+0 to U+FFFF. 3 octets suffice for it.

It's useful to know that MySQL support outside the BMP doesn't work, but I would guess it's a generic problem affecting all Unicode support, not restricted to UTF-8.

(Yes, UTF-8 was defined to go up to 6 octets and cover 31 bits. As used with Unicode, only up to 4 are supposed to be used...)

Yes you are right on the 3 vs 4 octets for outside BMP, it is the 4 octet UTF8 that MySQL pre 5.5 doesnt work with. With MySQL 5.5 the full basic LAMP stack at least will now handle non-BMP characters.

(My guess is wrong. They really did have a hardcoded 3 octet pseudo-characterset. Ugh.)

Aside from the curly quotes, his "unicode canary" still only uses Latin-1 characters. Dumb. At least throw some latin-2 and/or Cyrillic/Greek in there.

Even better: Include characters outside the Basic Multilingual Plane. That catches problems with UTF-16 that might exist along the way.

Even trickier: And add combining characters.

You can have more code points than characters, and letters-only string that contains codepoints that are not letters.

Unicode is truly evil at the edges.

Here's a modified version with Cyrillic, Greek and Polish (ISO-8859-2) letters, along with a few more for good measure.


a truly masochistic canary would include an old ASCII control character (x00-x32) some of which I don't believe encode properly into XML as either raw bytes or literal entities and also a character off the basic multilingual plane like say linear b

I do use UTF-8 for my entire web stacks, but I encode characters in their special characters:




When outputted to the browser.

Can anyone tell me if that is correct and sane?

Translating characters like & and € to &amp; and &euro; saves me a lot of hassle with validation: http://validator.w3.org/check?verbose=1&uri=http%3A%2F%2...

  Line 265, Column 190:
  non SGML character number 157
  <code>“Iñtërnâtiônà lizætiøn”</code>

  You have used an illegal character in your text. HTML 
  uses the standard UNICODE Consortium character 
  repertoire, and it leaves undefined (among others) 65 
  character codes (0 to 31 inclusive and 127 to 159 
  inclusive) that are sometimes used for typographical 
  quote marks and similar in proprietary character sets. 
  The validator has found one of these undefined characters 
  in your document. The character may appear on your 
  browser as a curly quote, or a trademark symbol, or some 
  other fancy glyph; on a different computer, however, it 
  will likely appear as a completely different character, 
  or nothing at all.

  Your best bet is to replace the character with the 
  nearest equivalent ASCII character, or to use an 
  appropriate character entity. For more information on 
  Character Encoding on the web, see Alan Flavell's 
  excellent HTML Character Set Issues reference.

  This error can also be triggered by formatting characters 
  embedded in documents by some word processors. If you use 
  a word processor to edit your HTML documents, be sure to 
  use the "Save as ASCII" or similar command to save the 
  document without formatting information.

  Line 344, Column 79: 
  cannot generate system identifier for general entity "src"
  An entity reference was found in the document, but there 
  is no reference by that name defined. Often this is 
  caused by misspelling the reference name, unencoded 
  ampersands, or by leaving off the trailing semicolon (;). 
  The most common cause of this error is unencoded 
  ampersands in URLs as described by the WDG in "Ampersands 
  in URLs".

Sure that works for some things but

1. You know there are no entity names for most unicode characters? Eg Chinese. You may as well use the numeric entity codes 2. In XML the entity names other than lt gt amp and quot are not defined unless you have a dtd, so you should not use them across an xml api, eg for an Atom feed, or across an xml web service, unless it defines a dtd including them which is unlikely. 3. If you get those errors, it is because you have something set up wrong. Those things are fixable. Fixing them will help you understand whats going on better. As the article says, get out your hex...

Thank you very much!

If I understand correctly, you are saying to drop the entity names and start using the numeric entity codes. This shouldn't be much of a problem.

I really did bump into problems with an RSS feed and entity names, so another great point. I solved that by wrapping it in <![CDATA[ ]] and using the numeric entity codes (&euml; becomes &#235;), so now I am wondering why I am even mixing entity names and numeric entity codes in the first place.

Break your habit of converting them to HTML entities. Put them in your source as actual UTF-8 characters.

If your page is really being delivered as UTF-8 then it will pass all validation, just using the real characters. (You still have to escape & as &amp; of course.)

Here, I put together a little example for you: http://50pop.com/i18n.html

View source to verify. Click the validate link.

Hope that helps.

The error does not come from “Iñtërnâtiônàlizætiøn”, but from “Iñtërnâtiônà lizætiøn”, which contains U+009D character (correctly encoded and delivered as UTF-8). Apparently that character is not allowed in HTML documents. So & is not the only character you need to escape, and seeing that even you didn't know this detail, I don't think it's a bad habit to play on the safe side and just escape all non-ascii (printable) characters.

edit: I read the HTML5 spec, it says: "Text must not contain control characters other than space characters". So a reasonable solution would be to pass all printable characters as UTF8 and encode control characters. But as I said, I'd prefer to err on the side of caution, in this case encode more than necessary if I'm not sure exactly which characters need encoding and which do not.

No, your problem is that the UTF-8 encoding of U+009D isn't 9d, it's c2 9d. So if you're encoding it as 9d, you're not writing out UTF-8, you're writing out latin-1, which of course leads to displaying random characters. Serve your page as utf-8 and encode it properly.

Who cares what validators say. You should use UTF. We're long past the point where web browsers don't support it.

I care what validators say or imply. The reason I started using UTF-8 in the first place, was because it is a W3C accessibility guideline to use UTF-8 over the more common (in that time) iso-8859-1.

If you want to create accessible websites, one of the first requirements is validated code.

If we ignore validators and the W3C, who is there to officially tell us what we _should_ do?

We are long past the point, sure, so much so, that W3C recommends it too.

And about browser support: If you want to guarantee that most browsers understand and support your code, your best bet is to adhere to the W3C that wrote the standard.

The W3C validator works fine. Please see this example: http://50pop.com/i18n.html

Maybe your webserver is not serving your page as UTF-8?

Get a better validator: http://validator.nu/

Rather than add --default-character-set every time you invoke the mysql client, add the setting to your my.cnf/my.ini config file.

Just set the client and server defaults in my.cnf. MySQL has four places where encoding can be set: client connection, server, database, column. Setting the client and server in my.cnf covers everything.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact