I say "binary" and "text" because the Internet cannot transmit text, it can only transmit "binary" octet streams. (Similarly, UNIX files can only store octets, and UNIX file names can only store octets other than / and NUL.) But, your programming language supports both text manipulation and binary manipulation, so you have to tell it how you want to treat the data. Each language is different; Perl treats everything as Latin-1 text by default (which happens to work nicely for binary, as well, but not so nicely for UTF-8-encoded text).
Often, libraries will handle this for you, since they have access to out-of-band information. If your locale is en_US.UTF-8, filenames can be assumed to be UTF-8-encoded. If the HTTP response's content-type says "charset=utf-8", your HTTP library will know to decode the octet stream into text for you. But it's important that you both test this and find the code that does it for you, because sometimes library authors forget or libraries have bugs, and one bug will ruin your whole operation.
Handling Unicode text is hard because it's a rare case where you have to get everything right or the results of your program will be undefined. And, there are no "reasonable defaults", so you have to be explicit about everything. Finally, you can't guess about what encoding your data is; all binary data must come with an encoding out-of-band, or your program will break horribly. Proper text manipulation is the ultimate test of "can I write correct software", and it isn't easy.
HTML is a good example. Browsers are very tolerate of malformed HTML, which is nice for beginners who don't want to worry too much about perfect syntax.
The problem is each browser handles the unspecified cases differently, which leads to differences in the way pages are rendered, security issues like XSS, etc.
Robustness should just be built into the protocol/format/spec, if necessary. HTML5 gets this right by specifying an algorithm that all parsers should use to get consistent behavior, while still being tolerant of imperfect syntax: http://en.wikipedia.org/wiki/Tag_soup#HTML5
There's no way to avoid it unless you wrap it up and add some explicit checks and guesses.
They should. If so there's no need to guess.
Facing to something less agamant (like dreaded id3 tags), no such luck.
¹: I believe Unicode encodes some meaningless Kanji/Hanzi glyphs that were created by accidentally confusing two other, genuine glyphs; I'm pretty sure it only does so because it inherited them from legacy pre-Unicode encodings.
The good thing about UTF-8 is that you don't have to choose.
1. Sarcasm is not wit.
2. Dismissal by analogy is no dismissal at all.
3. The polite individual addresses well intentioned questions in good faith or not at all.
Now, I've asked my question because, in a community of peers of various stripe, it's entirely possible that there will be members more versed in character encodings than myself and that there are, indeed, more optimal ways to encode characters which are not taken for need to maintain backwards compatibility with the ASCII character set. One poster mentioned UTF-32 as a non-backwards compatible example which, while space inefficient, is a good-faith answer to my question.
As to your clumsy dismisal by analogy question, here are some things off the top of my head:
* Specify parsing, rendering semantics from the start for all markup/presentation languages.
* Automated browser compliance testing from the start for all standards.
* Effective client-side user storage.
* Include stateful communication channels from the word go.
Innovation in a field of endeavor only occurs by examining the assumptions of the field and invalidating them as the context of their being changes. Those that don't care to do so are absolute hacks, stirring the sewer's mirk in the hopes that a shiny bobble will ocassional bubble up from the depths.
Presumably if there were a more efficient character encoding with sufficient advantages there would indeed be some 'useful' aspect to it. However, and I'd like to drive this point home, you have no business deriding the well-intentioned questions of others if you have nothing to contribute. Thus far you've outright dismissed even the validity of raising the questions--'who cares'--and, after being confronted, backpedaled somewhat but still dismissed the question's utility--'not terribly useful in real life'--without being so kind as to explain why, as if it were self-evident. Perhaps to you, but this somewhat the point of my last harangue: innovation starts with questions, even those which are seemingly naive.
You have been, to this point, not uncommonly, sadly, but rude indeed. If you have thoughts to share on the subject I would love to hear them. To contribute only blase dismissal tends to make individuals of less than adamantian character cease or hide their well-natured questioning of the world about them. I do not mean to suggest, of course, that every question posed should be met with twee praise: no, indeed. Rather, explorations should be met with good faith, disagreements elaborated such that rational observers might find in the conversation new ideas, or strengthening of their own. Our individual actions steer the culture in which we find ourselves; I hope we can both agree that it is a better world in which basic exploration is the norm; neither dogged travel in a well-worn rut if indeed a better road might be found.
Aside of that, IMHO, the database is the least problematic part of the chain as long as you tell the client library what character set the incoming data will be in. It then should transcode automatically if needed.
One additional thing: I once witnessed MySQL silently truncating Latin data I accidentally tried to store in an utf-8 table, so you might to be a bit careful. Usually you should just get an error if you tell the database that your data is utf, but it isn't (http://pilif.github.com/2008/02/failing-silently-is-bad/)
Lastly, IMHO, the biggest issue is, as usual, the browser: to this day it's possible to have IE submit data in ISO-* (depending on the users locale) despite clearly stating to only accept utf. Be mindful of this and fix the encoding if you can (or have the database blow up - see above)
Are the MySQL developer doing anything to improve the situation? Here are some possible improvements of the top of my head. I won't call them solutions. <:)
1. Use UTF-8 internally, converting based on the client's encoding.
2. Use UTF-8 internally and force client's to do their own conversions.
3. Tag all string data with its encoding type for run-time checks.
4. At the very least, default to UTF-8 rather than latin1-swedish-ci!
It's useful to know that MySQL support outside the BMP doesn't work, but I would guess it's a generic problem affecting all Unicode support, not restricted to UTF-8.
(Yes, UTF-8 was defined to go up to 6 octets and cover 31 bits. As used with Unicode, only up to 4 are supposed to be used...)
You can have more code points than characters, and letters-only string that contains codepoints that are not letters.
Unicode is truly evil at the edges.
When outputted to the browser.
Can anyone tell me if that is correct and sane?
Translating characters like & and € to & and € saves me a lot of hassle with validation:
Line 265, Column 190:
non SGML character number 157
You have used an illegal character in your text. HTML
uses the standard UNICODE Consortium character
repertoire, and it leaves undefined (among others) 65
character codes (0 to 31 inclusive and 127 to 159
inclusive) that are sometimes used for typographical
quote marks and similar in proprietary character sets.
The validator has found one of these undefined characters
in your document. The character may appear on your
browser as a curly quote, or a trademark symbol, or some
other fancy glyph; on a different computer, however, it
will likely appear as a completely different character,
or nothing at all.
Your best bet is to replace the character with the
nearest equivalent ASCII character, or to use an
appropriate character entity. For more information on
Character Encoding on the web, see Alan Flavell's
excellent HTML Character Set Issues reference.
This error can also be triggered by formatting characters
embedded in documents by some word processors. If you use
a word processor to edit your HTML documents, be sure to
use the "Save as ASCII" or similar command to save the
document without formatting information.
Line 344, Column 79:
cannot generate system identifier for general entity "src"
An entity reference was found in the document, but there
is no reference by that name defined. Often this is
caused by misspelling the reference name, unencoded
ampersands, or by leaving off the trailing semicolon (;).
The most common cause of this error is unencoded
ampersands in URLs as described by the WDG in "Ampersands
1. You know there are no entity names for most unicode characters? Eg Chinese. You may as well use the numeric entity codes
2. In XML the entity names other than lt gt amp and quot are not defined unless you have a dtd, so you should not use them across an xml api, eg for an Atom feed, or across an xml web service, unless it defines a dtd including them which is unlikely.
3. If you get those errors, it is because you have something set up wrong. Those things are fixable. Fixing them will help you understand whats going on better. As the article says, get out your hex...
If I understand correctly, you are saying to drop the entity names and start using the numeric entity codes. This shouldn't be much of a problem.
I really did bump into problems with an RSS feed and entity names, so another great point. I solved that by wrapping it in <![CDATA[ ]] and using the numeric entity codes (ë becomes ë), so now I am wondering why I am even mixing entity names and numeric entity codes in the first place.
If your page is really being delivered as UTF-8 then it will pass all validation, just using the real characters. (You still have to escape & as & of course.)
Here, I put together a little example for you: http://50pop.com/i18n.html
View source to verify. Click the validate link.
Hope that helps.
edit: I read the HTML5 spec, it says: "Text must not contain control characters other than space characters". So a reasonable solution would be to pass all printable characters as UTF8 and encode control characters. But as I said, I'd prefer to err on the side of caution, in this case encode more than necessary if I'm not sure exactly which characters need encoding and which do not.
If you want to create accessible websites, one of the first requirements is validated code.
If we ignore validators and the W3C, who is there to officially tell us what we _should_ do?
We are long past the point, sure, so much so, that W3C recommends it too.
And about browser support: If you want to guarantee that most browsers understand and support your code, your best bet is to adhere to the W3C that wrote the standard.
Maybe your webserver is not serving your page as UTF-8?