
Unicode over 60 percent of the web  - robin_reala
http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html
======
sqrt17
Occasionally, I have an irresistible urge to strangle everyone who uses
Unicode and UTF-8 (or other UTF encodings) interchangeably.

UNICODE is a good thing because it provides a codepoint for every character
that we care about, instead of having a 256-character subset for every groups
of languages and needing complicated software to puzzle out how to convert
from one subset to the other. Unicode allows fantastic stuff such as
upper/lowercasing text including all the weird letters that you previously had
to special-case.

ASCII used to be a good thing because it allowed people to ship around basic
English and Cobol code without any worries, but is actually pretty evil
because people from Anglosaxon countries assume that every other bit of text
is composed of English and Cobol.

Having a notion of ENCODINGS is useful if you occasionally get bits of text
that are neither English nor Cobol. You still needed different encodings for
different groups of languages, and arcane mechanisms to provide hints on which
encoding is meant, at least if you got non-English bits of text. The very
notion of an Encoding scares the people who used to think the world consists
of English and Cobol.

UTF-8 is a very reasonable encoding that can be used to represent all of
Unicode while being Ascii-compatible. Hence it is a sane choice as a default
encoding for people who are scared of having to think about encodings. Because
UTF-8 is not the only encoding out there, Unicode-compatible programs accept
Unicode text in many other encodings, including those that cannot represent
the full range of Unicode and are only a good choice for some people but not
others.

tl;dr: non-UTF-8 text can (and should) still be read as unicode codepoints.
Ignoring the >=40% of texts out there or saying that they're "not Unicode"
doesn't help anybody.

~~~
sambeau
This post makes no sense to me.

Are you saying that we should still use character encodings rather than UTF
encodings, or are you saying that we shouldn't assume that raw text is ASCII,
or are you saying something else?

Unicode is, essentially, nothing unless it is encoded. When you encode it you
must decide whether to use one byte, two bytes, three bytes or four bytes.

Only UTF-8 and UTF-32 are really big enough to hold the world's characters.
everything else is a fudge.

ASCII was never Anglo-Saxon: it was always American. COBOL is a red herring
here, too. ASCII was all about teleprinters and was a clever use of 8-Bits for
its time. Bell labs invented ASCII and Bell labs invented UTF-8.

UTF-8 is much better than reasonable. It is a compact way to represent Unicode
while preventing the western world from having to re-encode every text
document. That's a lot of good news the internet.

Are you saying that we shouldn't ignore other charsets as they are still valid
Unicode? If so I agree up to a point, the point being that there is no longer
any need to have any other unicode encoding other than UTF-8. If you need to
access your local characters as a byte array: choose your internal encoding
and translate, do your magic, and then spit out UTF-8 again then we can all
simply read the same documents without the need for over-complexity.

~~~
finnw
> _Only UTF-8 and UTF-32 are really big enough to hold the world's characters_

So is GB18030.

 __Edit __: and UTF-16, of course (just don't confuse it with UCS-2)

~~~
sambeau
I'm guilty of using UTF16 when I really mean 16 bit unicode arrays. I see that
now.

But it does beg the question why would you use UTF-16? Yes if you have more
than 512 characters in your common script then OK I can see it might make
sense, but not much. UTF-8 will still average out in a reasonable way.

~~~
jpablo
Because it's the standard in the Windows API.

------
rwmj
I wonder if we can now start to abandon non-UTF8 charsets? Ignoring Windows
which is wilfully incompatible and broken, UTF-8 is used pretty much
everywhere that matters, and I would argue that those who don't use it should
use it. If there is something UTF-8 or Unicode can't do, let's fix that.

~~~
sp332
I wouldn't recommend dropping other Unicode encodings. I assume e.g. UTF-16 is
more popular in non-ASCII-based locales since it never requires more than 2
bytes for characters in the BMP (basic multilingual plane). I guess you could
pretend the 30% of the web that's in ASCII is UTF-8, but then you'd still be
missing all the Chinese and Japanese sites that aren't using Unicode yet.
That's almost 10% of the web as a whole and obviously if you're targeting
those markets it would be a lot higher.

~~~
mekoka
There are very few rational reasons to be using utf-16 instead of utf-8 and I
would be as bold as to claim that the two aren't even close in popularity for
non-ascii texts, utf-8 is clearly well above. Regarding memory consumption,
latin texts also primordially use ascii characters, with only the occasional
non-ascii glyphs (cedillas, accents, tremas and other such characters), the
difference between utf-8 and 16 is that the former uses 1 byte to represent
characters in the ascii range and 2 bytes for the occasional â, whereas the
latter _always_ uses 2 bytes, which would explain utf-8's popularity in latin
based alphabets. Also, because having ordinary characters whose code points
are padded with zeros is just wasteful and inefficient, utf-16 and its
brethren utf-32 introduced funky attempts at optimizations such as Byte Order
Marks. Why would anyone wanna deal with this stuff when writing an
internationalized program?

~~~
garethadams
Tell that to the Japanese, whose glyphs are always 3+ bytes in UTF-8!

~~~
kijin
It doesn't really matter since most document formats already use compression.
No matter what encoding you use, the amount of entropy is the same. So UTF-8
usually compresses better than 2-byte encodings for CJK languages. This nearly
compensates for the increased size.

Example: The Korean text of the Universal Declaration of Human Rights [1] is
8.1KB in EUC-KR and 11.2KB in UTF-8. When compressed with bzip2, it's only
3.1KB and 3.2KB, respectively. I assume Japanese would behave similarly.

[1] <http://www.ohchr.org/EN/UDHR/Pages/Language.aspx?LangID=kkn>

------
fsckin
Learned a fantastic new word from this article, thanks!

Mojibake (文字化け?) (IPA: [modʑibake]; lit. "unintelligible sequence of
characters"), from the Japanese 文字 (moji) "character" + 化け (bake) "change", is
the occurrence of incorrect, unreadable characters shown when software fails
to render text correctly according to its associated character encoding.

<http://en.wikipedia.org/wiki/Mojibake>

~~~
sp332
The best part of that article: handwritten mojibake
[https://en.wikipedia.org/wiki/File:Letter_to_Russia_with_kro...](https://en.wikipedia.org/wiki/File:Letter_to_Russia_with_krokozyabry.jpg)

~~~
robin_reala
I’m amazed that the postal services decoded that!

------
js2
Obligatory reference to Unicode and encoding -
<http://www.joelonsoftware.com/articles/Unicode.html>

------
joshuahedlund
That is an incredible increase in a short amount of time, especially to pass
60% with no sign yet of even a decreasing slope. The post offers no
explanation for the sudden surge, and I am too ignorant on this topic to
speculate. Was there a sudden change in the defaults of operating systems /
languages / etc? (Without details that's the only thing that seems generally
plausible to me)

~~~
mkr-hn
Most of the rise started when blogging took off. A few hundred million
WordPress and Blogger blogs would do it. The non-UTF-8 slice of the pie got
smaller.

------
stephen_g
It's good to know that it's on the increase. Still, character encoding seems
to be something that is understood by very few people... I wonder how many of
these sites are just reporting UTF-8 or something because their web server
defaults to it, and not actually encoding special characters properly?

It's certainly very easy to do - I had the problem a little while back setting
up a little web app where the web server and MySQL database were both UTF-8,
but the db connection was defaulting to ISO-8859-1 or something, causing all
sorts of issues with curly quotes etc.

~~~
obtu
> I wonder how many of these sites are just reporting UTF-8 or something
> because their web server defaults to it, and not actually encoding special
> characters properly?

I think they are measuring the encoding that is picked before adding a page to
the index. Which would be a combination of explicit information (headers, xml
and html metadata) and heuristics (which may fail, but are useful when the
explicit information is missing or — even though ignoring explicit metadata is
bad — obviously incorrect).

It also seems they are looking for a subset encoding once they have the
metadata; the posts describe explicitly labelling ASCII when the contents are
within that subset.

But their methodology has changed from the previous two posts: latin above
ascii in 2008 didn't exist in [http://googleblog.blogspot.com/2010/01/unicode-
nearing-50-of...](http://googleblog.blogspot.com/2010/01/unicode-
nearing-50-of-web.html) or [http://googleblog.blogspot.com/2008/05/moving-to-
unicode-51....](http://googleblog.blogspot.com/2008/05/moving-to-
unicode-51.html) .

~~~
justincormack
Yes I wondered this. I would love to see figures for incorrectly labelled on
that chart.

------
sambeau
If have just been writing a lot of C code to support unicode in a new project.
It reads UTF-8 converts to UTF-32 does lots of clever stuff and spits UTF-8
out once again.

Here's my take: UTF-8 should, rightly, be the only interchangeable text format
of choice for the right-minded individual. The other UTFs should be internal
formats used in-memory or on disk cache / database / etc

Why? Simple: UTF-8 rocks!

It's a beautifully designed, backwardly compatible, nifty piece of back-of-a-
napkin genius. Simple to code and decode (once you understand it), simple to
check with a regular expression (once you realise it is essentially just a
token with a set number of chars), and simple to add the wealth of the world's
characters to your app with a reasonable amount of code. Plus, it's compact in
the way Huffman coding is compact (at least from a western perspective).

Also, no-one should ever be using (char ❄) in the 21st century unless it is to
temporarily hold a UTF-8 string before converting to wchar.

Why? Because you almost certainly don't have a (char ❄), you have a UTF-8
sequence (^^^see above). Sadly this makes your memory mapped files slightly
redundant. But don't fret, this is the future: convert them to 32-bits and
release them. Be happy that you can now treat any character sequence, in
common usage, in the whole of humanity like an array.

As for UTF-16, why bother? It's neither compact, clever, nor big enough to
hold every character on the internet. &#128169; needs more than 16 bits and
everyone, now, needs to support a poop with eyes.

tl;dr: Share UTF-8 promiscuously, keep UTF-32 for private moments. Don't dally
with UTF-16, she's an old tease and can't handle poop.

note: ❄ = asterisk :)

~~~
simcop2387
UTF-16 will handle poop just fine. UTF-16 handles higher characters in a
manner just like UTF-8. Pile-of-poo is encoded as D83D DCA9 in UTF-16, The
same size as it is in UTF-8.

There may be a distinct size advantage for some asian cultures to using UTF-16
instead of UTF-8 as it will allow for encoding more of the glyphs without
having to add more overhead bits. How much this saves in reality I'm not sure.

~~~
ruediger
I wouldn't use UTF-16 unless having to work with a legacy system. A lot of
software claiming to handle UTF-16 is broken and really only works with UCS-2.
You have to worry about endianess and so on.

If you develop an application for the international market you should probably
go with UTF-8 as well. Maybe if you develop only for the Asian market it's a
bit different. But in my experience significant amounts of text usually come
in some form of data or markup format (HTML, XML, JSON, etc.) and usually
those markup formats are defined in the ASCII subset. So UTF-8 still wins.
Just take a random page from the Japanese Wikipedia and encode it in UTF-8 and
in UTF-16. You'll see that UTF-8 almost always wins.

------
jbarham
FWIW, UTF-8 was designed by Ken Thompson:
<http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt>

So his inventions of Unix and UTF-8 now dominate both the back- and front-ends
of the internet.

------
chalst
From the arty: _When you include ASCII, nearly 80 percent of web documents are
in Unicode (UTF-8)._

...which indicates that nearly all unicode is UTF-8.

~~~
sjwright
Can you find a website that uses a non-UTF-8 Unicode encoding? (Actually now
that I think about it, UTF-16 probably makes sense for non-Latin languages and
might well be common. Does anyone have any insights here?)

~~~
olavk
ISO-8859-n were pretty common in Europe, because they were the default charset
on windows. Just checked the web pages of two major danish newspapers: They
use ISO-8859-1. Utf-8 is getting more widespread though.

~~~
guard-of-terra
ISO-8859-n is not Unicode.

~~~
olavk
They are considered unicode encodings.

~~~
derleth
No, they are not.

~~~
olavk
<http://www.w3.org/TR/html4/charset.html> considers ISO-8859-n character
encodings:

"Commonly used character encodings on the Web include ISO-8859-1 (also
referred to as "Latin-1"; usable for most Western European languages),
ISO-8859-5 ..."

~~~
guard-of-terra
ISO-8859-5 has NEVER been a common encoding on the Web.

KOI8-R was, then Windows-1251. Now it's often UTF-8.

------
rabidsnail
All that's saying is that software developers have discovered that there are
alphabets other than roman, and people use them. But all it takes to be UTF-8
is to have your web server set the Content-Encoding header or add a doctype to
your page. If you stick to roman characters utf8 and extended ascii are
exactly the same.

What I really want to know is the breakdown by alphabet. How many sites are in
Cyrillic? Kanji? Arabic? Thai?

------
Sami_Lehtinen
As Finnish user, I have found out that using UTF-8 as default charset (in
browser) is not practical option. There are so many sites using ISO-8859-1 out
there, which do not inform browser about their charset choise at all. So when
I'm using UTF-8 as default, I get pages with mangled scandinavian characters
all the time. öäåÖÄÅ

~~~
Sami_Lehtinen
One example, even people at <http://cert.fi> don't know how to provider
correct character encoding data: <http://bayimg.com/EaMOjaADi>

------
lexx
I thought only dinosaurs used something different than utf-8.

------
its_so_on
for a moment, I parsed the headline as "You need to scan 60% of the web to
find an instance of each and every unicode glyph (being used organically)..."

Actually, outside of listings of all unicode/code pages, I'm 98% sure there is
a vast quantity of unicode characters that is not used once (organically) on
the entire Internet.

I'd bet even money you can craft a two-character 'word' such that you would be
the top Google result for it if you use it just once or twice in any context
on any page Google indexes, just because you're the first person to use those
characters organically, to say nothing of together. /s

------
Muzza
> As you can see, Unicode has experienced an 800 percent increase in “market
> share” since 2006.

> The more documents that are in Unicode, the less likely you will see mangled
> characters (what Japanese call mojibake) when you’re surfing the web.

Hmm. Well, I know that's true in theory, but my personal experience is that
the occurrence of mojibake has increased lately. I even wrote about it here on
HN: <http://news.ycombinator.com/item?id=2075010>

~~~
lmm
I think a lot of new code tends to just assume UTF-8, and if you try and use
something else that's your problem (I know I do this). Older code had to be
able to detect encodings.

------
NanoWar
Unicode is the dollar of the internet?

------
olavk
Actually, _the whole web_ is unicode today, AFAIT. Unicode is defined as the
character repertoire of HTML and XML. In olden days we had different charsets
(like ASCII, ISO-8899-n and so on), but these have simply been redefined in a
html/xml context to be character _encodings_. So ASCII, ISO-8859-n etc. are
considered unicode encodings which happen to be only able to represent a
subset of the full unicode character repertoire.

~~~
derleth
Absolutely none of that is correct.

~~~
olavk
A lot of people seem to agree with you, but I believe the HTML 4 spec supports
my point: <http://www.w3.org/TR/html4/charset.html>

HTML is defined to use Unicode as the document character set. But the
charaters can be represented as byte-streams using different encodings, UTF-8
beeing one encoding, ISO-8859-1 beeing another encoding.

> The "charset" parameter identifies a character encoding, which is a method
> of converting a sequence of bytes into a sequence of characters.

A lot of people seem to confuse Unicode with the UTF-encodings.

~~~
derleth
> I believe the HTML 4 spec supports my point

That document is badly-written; a more reasonable way to interpret it is to
conclude that user agents will use some form of Unicode internally, after
converting whatever character encoding the document they received used. Which
is, indeed, a very reasonable way to design your software, but it doesn't make
Latin-1 (for example) a Unicode encoding by any reasonable standard.

> A lot of people seem to confuse Unicode with the UTF-encodings.

True. I do not.

