
Unicode nearing 50% of the web - wglb
http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html
======
viraptor
As a person with a unicode (or iso-which-noone-ever-uses) first name - I'd
like to thank all of you who enabled unicode on your databases and pages. It
makes my life much easier and pleasant if I can use my real name during
registration. Even if I keep getting my snail mail with '?'s, '+'s or
urlencoded...

~~~
brlewis
You're welcome.

------
quant18
I'd really rather go back to the old way when there were lots of competing
national encodings for each language, and the actual users of that language
could vote with their web pages/documents for which one they preferred.
Instead we have this one overarching encoding whose subparts were fixed for
all time by fiat from committees _before_ being put into practical use, and as
you might expect, some of those committees really screwed things up.

For example, some _fine upstanding gentlemen_ decided that in Unicode (and
GB-18030), Mongolian ᠣ and ᠤ, which are printed/handwritten exactly the same,
shall be two "different letters" U+1823 and U+1824, but the different forms of
ᠳ are the "same letter" U+1833. (And of course, there's ᡩ U+1869 which _looks_
like what you want for some forms of U+1833, but you're not supposed to use it
because it's "only for Xibe").

The closest analogy I can give in English is an encoding which forced you to
use different "a" codepoints for the characters in apple vs. fake because of
their different pronunciations, while making a single codepoint for "k" and
"ck" and "c" (but only sometimes) because they sound the same. If you ever saw
an encoding like that, you'd no doubt say to yourself: "WTF? I'm not using
this, I'll stick with ASCII/EBCDIC/Morse code, thank you very much".

~~~
viraptor
Please no... It ended up with a simple language like Polish having latin2,
cp1852 (or something like that), mazovia, mazovia2, and probably some more
homebrew encodings. I can't even imagine what would happen for completely
different scripts like Indian.

There are problems with unicode - ok, let's resolve them then. I still want to
be able to address my email to the real name of person named in language A,
living under address of country B, signing the email properly in language C.
(where all parts use language-specific characters) Unicode is the first
standard which allows me to do that in most cases, so I guess it's a step in
the right direction.

~~~
quant18
_I can't even imagine what would happen for completely different scripts like
Indian._

As far as I know, Ge'ez (for Ethiopian languages) sets the record with 70+
encodings [1] which took all sorts of different approaches.

The Unicode design process worked out quite happily for Latin and Cyrillic
alphabet users because

1\. There was widespread agreement about what is the smallest indivisible unit
of the script (thanks to long history of literacy education, decades of
typewriter usage, etc.). No one suggested _brilliant_ schemes like encoding
"O" as "C" plus a right-concave combining mark ")" or I as "T" plus an
underline, for example.

2\. Among the hundreds of millions of users of those scripts, there were
enough countries which had a reasonable history not just of typewriter usage
but also of computer usage, enough time for them to develop various competing
encodings whose mistakes Unicode could learn from

Inner Mongolian script pretty much presented the worst-case scenario compared
to the above criteria:

1\. The actual users of the script were a small and poor population with high
illiteracy rates and not many computer users; and unlike e.g. Cambodians or
Ethiopians, they had no big diaspora population of refugees living in the US
or other high-tech countries either (hence no one fluent in English to
advocate for them and point out problems in the proposed encodings).

2\. As a result of #1, disproportionate amount of discussion surrounding the
encoding was generated by scholars whose main aim was digitising quirky
classical texts, not everyday people who wanted to write everyday things
without the computer making them think of extraneous details they don't think
of when they're writing by hand.

3\. These scholars can't even agree what is the basic unit of the script (in
Russian grad schools, they teach it as an alphabet; in Japanese grad schools,
they teach it as a syllabary)

[1] <http://www.punchdown.org/rvb/papers/EriPaper3C.html>

------
mooism2
By "Unicode" they mean "UTF-8".

~~~
pmjordan
None of the other explicitly listed encodings are unicode encodings, and the
"other" category is tiny. So the statement is still true. Some browsers don't
even support other unicode encodings, so this doesn't surprise me. UTF-16 is
the only one that even stands a chance; I've never seen UTF-32 used for files,
and I've never seen UTF-7 used at all. I suspect UTF-16 is more efficient than
UTF-8 for east Asian scripts, but that advantage probably dwindles when
content is gzipped.

It's good news that Google are now decomposing ligature codepoints, although I
do wish they had a version of their search that was literal; especially with
programming-related and other technical searches, the special characters it
filters out are often crucial.

~~~
robryan
There's no way around the symbols being filtered out? I run into this all the
time to, always assumed there would be a way around but never bothered to look
into it.

~~~
pmjordan
If there is a way, I haven't found it. Enclosing the search terms in quotes
sometimes seems to help a little.

------
thirdstation
I wonder how many of those sites are pushing Unicode without knowing it. I
still see programmers and non-programmers scratching their heads over
character encoding issues.

~~~
zmimon
I reckon there's a whole bunch pushing non-unicode and not knowing it, but
declaring UTF-8 anyway. Since the current versions of PHP don't even support
unicode (at least, without going to very special pains) I suspect there are an
awful lot of web sites just shoving out content in non-unicode formats,
calling it UTF-8 and wondering why every now and then they see a funny
question mark in someone's name etc.

------
happenstance
Why does MySQL still default to latin1? (Or am I mistaken?)

~~~
wvenable
Because changing the defaults might mean a world of hurt for people not
expecting it. You should just always be explicit.

~~~
happenstance
> You should just always be explicit.

I generally agree.

Note: I think if you leave out any specific encoding configuration from your
my.cnf, mysql implicitly goes with latin1.

For users who are assuming a modern unicode/utf8 default setup, might be nice
if the mysql folks would require an explicit config for this setting.

------
mjgoins
Don't they mean ½?

------
gchpaco
It's interesting that UTF-8 has largely achieved its success amongst English
speakers and Latin-1 languages--almost everything else has remained more or
less static or has a slow downward trend. Since we're not seeing an increase
in UTF-16 or UCS-2 or anything else like that, this would seem evidence that
the Web, or at least Google's view of it, is becoming even more increasingly
dominated by Western European languages, which is itself an interesting idea.

~~~
pmjordan
I don't see how you're inferring language share within UTF-8. Being a unicode
encoding, it can be used to represent text in all popular scripts; that's kind
of the point.

~~~
gchpaco
But UTF-16 is far more advantageous an encoding if you speak Chinese, Arabic,
Russian, or Japanese; UTF-8 can take up to four bytes per character for some
of those.

~~~
jmillikin
The interesting character ranges are:

    
    
      U+0000 to U+007F: 1 byte  for UTF-8, 2 for UTF-16
      U+0080 to U+07FF: 2 bytes for UTF-8 and UTF-16
      U+0800 to U+FFFF: 3 bytes for UTF-8, 2 for UTF-16
      U+10000 and up:   4 bytes for UTF-8 and UTF-16
    

Arabic, Cyrillic (Russian), Tamil, Thai, Hebrew, and many other non-European
scripts take an equal amount of space in either UTF-8 and UTF-16. Japanese and
Korean have some cases where UTF-16 is more efficient, but only by a byte, and
most Chinese characters take 4 bytes either way.

~~~
pmjordan
Additionally, since we're talking about web pages, the markup tags will still
be composed of ASCII characters, for which UTF-8 has an advantage.

