
Unicode is, in 2012, still “cutting edge.” - masklinn
http://golem.ph.utexas.edu/~distler/blog/archives/002539.html
======
timr
Yeah, well, Unicode is hard. If it looks simple, it's only because you haven't
looked carefully enough:

<http://www.unicode.org/versions/Unicode6.1.0/>

The latest _published_ version of the standard (5.0) runs to _almost 1500
pages:_

<http://www.unicode.org/book/aboutbook.html>

(Best quote from that page: _“Hard copy versions of the Unicode Standard have
been among the most crucial and most heavily used reference books in my
personal library for years.”_ \-- Donald Knuth)

People think that Unicode support is just a matter of implementing multi-byte
characters, but it's so much more: you've got collation rules, ligatures,
rendering, line-breaking, punctuation, reading direction, and so on. Any
technical standard that aims to cover all known human languages is going to be
a little bit complex.

~~~
koenigdavidmj
> _People think that Unicode support is just a matter of implementing multi-
> byte characters, but it's so much more: you've got collation rules,
> ligatures, rendering, line-breaking, punctuation, reading direction, and so
> on. Any technical standard that aims to cover all known human languages is
> going to be a little bit complex._

Half of those are irrelevant to MySQL or any other database. Those are front
end problems. Even reading punctuation, text direction, and the like will only
be important in more advanced collation orders (as opposed to just binary
ordering, which he was using).

~~~
timr
It's true that glyph rendering won't matter to a database store, and that
binary collation is "good enough" for most people (but then again, "good
enough" is how you get to a UTF-8 implementation that doesn't support 4-byte
characters). That said, it's also true that Unicode characters outside of the
BMP are still pretty exotic/specialized:

[http://stackoverflow.com/questions/5567249/what-are-the-
most...](http://stackoverflow.com/questions/5567249/what-are-the-most-common-
non-bmp-unicode-characters-in-actual-use)

Is it "excusable" that MySQLs implementation of UTF-8 isn't standard? That's a
judgment call (they _are_ up-front about it in the docs). But given that most
unicode characters "in the wild" lie in the BMP, I can see how they'd make
that trade-off. There might well be a technical limitation lurking somewhere
in the database internals that made 4-byte characters a problem.

~~~
sedev
No, it's not excusable - the MySQL project had a trivial alternative: don't
call it UTF8!

The options are simple: implement UTF-8 correctly and call your implementation
UTF-8, or implement just the BMP and name accordingly. They did neither:
effectively, they _lied_ to end-users. That's deeply, deeply problematic.

~~~
timr
_"they lied to end-users, effectively"_

Yeah, that's not exaggeration at all. Because it's not as if they _very
clearly document_ exactly what they support:

<http://dev.mysql.com/doc/refman/5.1/en/charset-unicode.html>

~~~
pepve
Sure, so let's call this button that erases your data "list files", and just
_very clearly document_ it. It's not lying if we change the definition.

~~~
deafbybeheading
Well, to be fair, MySQL has a storied history of implementing 95% of a
feature, calling it good enough, and shipping it.

And while, as a Postgres user, my tone here may be a little snide, I also say
this with grudging respect: I think there is a point at which implementing n%
of a feature X and calling it X (rather than MaybeX or MostlyX) does give you
some momentum and practical compatibility that you wouldn't have otherwise. Is
it dishonest to hide the limitations regarding the edge cases in some
documentation no one will read? Maybe. But will providing the feature solve
more problems than it causes? Quite possibly.

I don't agree with MySQL's decision with respect to UTF-8, but I do understand
it.

~~~
sedev
That's an important piece of context, thank you for pointing it out.
Engineering decisions occur in a cultural context of mere humans making
decisions, and we do well to remember that.

------
VMG
> MySQL’s utf8 encoding only covers the BMP. It can’t handle 4-byte characters
> at all.

Wow. That is pathetic.

~~~
excuse-me
No it's engineering. Here are two versions of the software.

1, is faster, better tested and string handling (it's a database!) is much
faster but it only handles the 65000 most common characters

2, this one can handle upside down characters from a 1930s paper on formal
logic in Turkish. But is slower for all other cases and we haven't really
tested it as much,.

Do you have a redundant,self powered , asteroid impact proof internet
connection? No? Pathetic !

~~~
pindi
Sure, in some cases it makes sense to make the tradeoff of not handling more
obscure characters. But if the tradeoff is made, the encoding should not be
called UTF-8.

"UTF-8 (UCS Transformation Format—8-bit[1]) is a variable-width encoding that
can represent every character in the Unicode character set," says Wikipedia.
The UTF-8 implementation in MySQL does not meet this definition because it
cannot represent every character in the Unicode character set.

~~~
wmf
When MySQL first implemented UTF-8 they probably _did_ support every Unicode
character... because there were less than 64K Unicode characters. Then
Unicode/UTF-8 was redefined out from under them.

~~~
quink
> there were less than 64K Unicode characters. Then Unicode/UTF-8 was
> redefined out from under them.

Unicode 2.0 introduced multiple planes, i.e. more than 65536 characters. That
was in 1996. If that was the case, then MySQL has had more than one-and-a-half
decades to introduce multiple planes and seems to have done so less than a
year ago. I disagree with being 'redefined out from under them', when it was
defined a year after MySQL started, at a time when it probably didn't even
have Unicode support yet anyway.

~~~
excuse-me
But when did people really start using more than the 16bit unicode chars?

~~~
quink
> But when did people really start using more than the 16bit unicode chars?

1996.

China even made it a legal requirement for computer systems in 2000, through
mandating GB 18030.

There's the Private Use Area if nothing else. There is NO excuse to not
support anything other than the BMP. Adding support is trivial unless you have
been using UTF-16 in the erroneous belief that it's two bytes long always (in
which case you've really been using UCS-2).

------
wmf
It looks like we're still seeing fallout from the "16 bits are all you need"
thing. Maybe telling people that they'd never need to worry about this stuff
after adopting (BMP subset of) UTF-8 wasn't a great idea.

~~~
derleth
> (BMP subset of) UTF-8

The hell of it is, UTF-8 expands _gracefully_ to the astral planes; it's
UTF-16 that you need to worry about, either because the people designing the
software never heard of surrogate pairs, in which case they didn't give you
UTF-16 but UCS-2, or implemented surrogate pairs incorrectly.

------
peteretep
This particular limitation of MySQL hit me HARD HARD HARD circa 2008 when I
tried to use some of that upside down text you find online as test data, and
just couldn't work out why I was getting data corruption.

Luckily Perl's Unicode support is fantastic, and saved my ass

~~~
postfuturist
How did you solve MySQL's unicode limitations with Perl?

~~~
lmm
At my previous^2 company we worked around MySQL's limitations by just storing
our data in VARBINARY columns, encoded as utf8 on the client-side. Worked like
a charm. (I hate MySQL)

------
joelhaasnoot
Unicode is just plain hard to do right...

I worked on a product for a couple of months geared at minority languages in
developing countries - doing linguistics work etc. It was a pain to support
Unicode, because there's lots of code points, lots of weird cases (what's
capital?), ICU is a good lib but it's not up to date to the latest Unicode
version and it's C/C++ (and thus a pain in C#). Oh, and there's the Private
Use Area where characters go while Unicode decides to include them or not...

------
antihero
If we're using PostgreSQL is everything going to be pretty much dandy?

~~~
ocharles
As long as your clusters use the right locale/encoding settings, yes.

~~~
justincormack
Only use a locale that does unicode collation if you need to, as it is a big
performance hit.

------
PaulHoule
it's almost scandalous how poorly Unicode has been implemented in the most
popular development platforms.

Java, for instance, implemented 2-byte encoding and uses surrogates for the
higher planes, which means you get the worst of both worlds... You double the
size of ASCII text (that is, half the speed of bandwidth-limited operations on
text) and you've still got a variable length encoding... but you've got lots
of methods and user-written code that assume that the text is fixed length
encoded. what a mess

------
a2tech
Thats because Unicode is STILL a pain to use. I read the articles that come
along about Unicode and still don't understand why handling it is so
impossible. Until its transparent for a programmer to use, it won't be as
widely used as it should. My apps (and I'm ashamed to admit it) aren't Unicode
friendly. But its too much work currently for too little reward to go through
and make all that code Unicode friendly.

~~~
henrikschroder
> Thats because Unicode is STILL a pain to use.

I think that depends heavily on what programming language and frameworks you
are using.

~~~
jpdoctor
Pointers to a comparison across languages would be welcomed. TIA.

~~~
LoonyPandora
<http://training.perl.com/OSCON2011/index.html>

Specifically the talk titled "Unicode Support Shootout: The Good, The Bad, &
the (mostly) Ugly"

It's a year old now, but it's still relevant. It gives a very detailed look at
unicode support across JavaScript, PHP, Go, Ruby, Python, Java, and Perl.

------
dunham
Apple's "textedit" can't open files with these characters in them. It reports
"The document “test.txt” could not be opened. Text encoding Unicode (UTF-8)
isn’t applicable."

~~~
lambda
Works fine for me. What version of Mac OS X/TextEdit are you using? Are you
sure you are saving it (and opening it) as UTF-8?

~~~
dunham
Sorry, my mistake. I was saving a file from textmate with the string "𝖙𝖊𝖘𝖙" in
it and then opening with "open -a textedit eg.txt".

The same experiment with cat in place of textmate works fine, so it's textmate
that is buggy.

According to "od -x1", textmate is writing:

    
    
      0000000    ed  a0  b5  ed  b6  99  ed  a0  b5  ed  b6  8a  ed  a0  b5  ed
      0000020    b6  98  ed  a0  b5  ed  b6  99                                
    

So textedit is right to complain.

~~~
lambda
Yeah, looks like Textmate is simply running the UTF-8 algorithm over UTF-16
code units, so each surrogate is being turned into a single UTF-8 code unit
(which decodes to an invalid character).

It turns out that this is such a common mistake that there's even a name for
this encoding, CESU-8: <http://en.wikipedia.org/wiki/CESU-8>

------
chadrs
You don't even want to know the hack I had to implement to get non-BMP
characters stored in MySQL 5.1 'utf8' column.

~~~
pjscott
That's exactly why I use the default ISO-8859-1 encoding for all my MySQL
tables, try to only stick ASCII into it, and store any Unicode text as binary
UTF-8, encoded client-side. It's stupid and shouldn't be necessary, but at
least MySQL can't screw it up.

------
tezza
Man, I still find myself reaching for

    
    
      dos2unix
    

all the time. How sad.

------
webreac
These 16bits, UCS, wide char ideas are just plain wrong ! Just use UTF-8 for
the files and for communication and 32bits code points internally when needed.

------
krollew
Yeah, so what? UTF-8 is Linux standard. I use it, probably most of web use
it(where needed). Why I don't see problems with it in software I use and on
the web? I can't remember when I had any problem with UTF-8 last time.

------
mrj
This is a MySQL limitation that they have fixed in recent releases (as the OP
notes). It's not fair to blame Unicode at large for MySQL problems.

However, it's true that Unicode is (relatively speaking) very new for such a
fundamental technology. Support in applications still varies widely. I
wouldn't characterize it as cutting edge though, since we have many mainstream
programming languages built using Unicode internally.

~~~
masklinn
> This is a MySQL limitation that they have fixed in recent releases (as the
> OP notes)

TFA notes that this is _not_ fixed, the `utf-8` mysql encoding still isn't
utf-8. And as TFA _also_ notes related technologies (aka drivers) may not be
compatible with it (the example he uses, mysql2 for Ruby, _still_ hasn't had
an official release supporting utf8mb4[0])

> it's true that Unicode is (relatively speaking) very new for such a
> fundamental technology

That's becoming quite hard an argument to swallow when encountering astral
planes issues in 2012 when Unicode 2.0 was introduced in 1996.

[0] <https://github.com/brianmario/mysql2/issues/249>

~~~
mrj
> > it's true that Unicode is (relatively speaking) very new for such a
> fundamental technology

> That's becoming quite hard an argument to swallow when encountering astral
> planes issues in 2012 when Unicode 2.0 was introduced in 1996.

I don't get your argument. MySQL was also released around that time and we
don't call it "cutting edge" because we found a bug. There are bugs in old
stuff all the time but (most) people don't throw a fit.

~~~
pyre
Unicode is being called 'cutting edge' because it's no longer 'old hat.' Lots
of things claim support for Unicode, but few (or none) support it well.
Unicode isn't a software project, it's a spec/idea. It's like calling a Star
Trek tricorder "cutting edge" because no one has implemented a fully-
functional version. Sure the idea has been around for a while, but at this
point there's no acceptable manifestations of that idea.

