
Unicode support in MySQL is... - codeka
http://www.codeka.com.au/blog/2014/02/unicode-support-in-mysql-is--
======
ggreer
_Now presumably there is some drawback from doing this, otherwise "utf8mb4"
would be the default (right?) but I'll be damned if I can figure out what the
drawback is._

There are a couple, and they are insidious. From
[http://geoff.greer.fm/2012/08/12/character-encoding-bugs-
are...](http://geoff.greer.fm/2012/08/12/character-encoding-bugs-
are-%F0%9D%92%9Cwesome/):

 _This is when I discovered that InnoDB limits index columns to 767 bytes. Why
is this suddenly an issue? Because changing the charset also changes the
number of bytes needed to store a given string. With MySQL’s utf8 charset,
each character could use up to 3 bytes. With utf8mb4, that goes up to 4 bytes.
If you have an index on a 255 character column, that would be 765 bytes with
utf8. Just under the limit. Switching to utf8mb4 increases that index column
to 1020 bytes (4_ * _255)._

And later:

 _You might say, “OK, now we’re finished, right?” Ehh… it’s not so simple.
MySQL’s utf8 charset is case-insensitive. The utf8mb4 charset is case-
sensitive. The implications are vast. This change in constraints forces you to
sanitize the data currently in your database, then make sure you don’t insert
anything with the wrong casing. At work, there was much weeping and gnashing
of teeth._

Good luck. The real solution is to switch to something that isn't MySQL. The
project in that post ended up switching to Cassandra. For new SQL stuff, I now
use PostgreSQL.

Edit:

To those of you who point out that utf8mb4_unicode_ci is case-insensitive, we
did try that. This whole fiasco happened two years ago, and I don't remember
what the show-stopper was, but we decided not to use that collation. Whatever
issues we had with it were bad enough to warrant switching to utf8mb4_bin and
sanitizing everything. Really, the solution is to switch away from MySQL and
forget about collations and encodings.

~~~
vog
_> For new SQL stuff, I now use PostgreSQL._

That's so true! Years ago I switched almost all projects to PostgreSQL. I
never looked back. I use MySQL only for very legacy applications.

MySQL has so many bugs an inconsistencies, unicode support is only one of many
issues. (Unsafe GROUP BY, silently cutting concatenated text fields when the
result grows too big, silently converting invalid dates to '0000-00-00', and
so on ...)

In the past, MySQL had some performance advantages over PostgreSQL, but only
for MyISAM, i.e. without real transaction safety. Nowadays, if you want to
trade ACID for speed, you wouldn't use MySQL but some "NoSQL" database
instead. So even from a performance point of view, I see no point in using
MySQL for anything but legacy stuff.

~~~
contingencies
I switched to _sqlite_. I've done my time on SQL servers. In my view, they
simply have too much complexity for many scenarios. Basically, one big
database server equals one big downtime when issues occur. Sharded data
architecture on the other hand can design for failure and maintain graceful
degraded performance, not to mention optimized resource allocation and the
option of adopting vastly different security and backup policies across
different parts of the datastore (eg. encryption, snapshots, etc.). Yes, you
lose database-internal consistency guarantees. No, that doesn't mean you have
to lose consistency.

~~~
geon
I did that in 2006. A big win was maintenance, since downloading/uploading a
copy of the database was so simple. Eventually I had performance problems when
doing read-write-read-write stuff, since writing locks the entire db file.

You could argue I shouldn't be doing that, and you may be right. But I just
switched back to MySQL, and the problem went away.

~~~
rogerbinns
SQLite fixed that issue in 2010 with write ahead logging (version 3.7). Writes
go to a separate file so there is no impact on readers of the main file. It is
a little more complex than that and described at
[https://www.sqlite.org/wal.html](https://www.sqlite.org/wal.html)

------
pilif
Anther nice "feature" of mysql: If you have a table configured to UTF-8, and
you use utf-8 client encoding but then accidentally try to send something
that's not UTF-8 (shit happens), then mysql will happily accept your data and
truncate it on the first byte it doesn't understand.

It issues a warning, but most frameworks don't care to inform the caller if a
warning has happened (some don't even provide a way to access them).

Yes. You should not be sending invalid data to your database, but holy sh*t,
your database shouldn't (mostly) silently alter the data you entrust it with.

If the input data is wrong and you can't deal with it, blow up in the face of
the user. Don't try to "fix" it by corrupting it.

Disclaimer: This happened to me in 2008
([http://pilif.github.io/2008/02/failing-silently-is-
bad/](http://pilif.github.io/2008/02/failing-silently-is-bad/)), so it might
have been fixed since, but stuff like this made me lose my trust in MySQL long
ago.

~~~
tomvangoethem
Except for causing annoyances, the truncation of data may also have security
implications. Last year, I found that WordPress (which has the utf8 character
set by default), was vulnerable to PHP object injection because their filter
could be bypassed by (ab)using the truncation of strings in MySQL. For more
details: [http://vagosec.org/2013/09/wordpress-php-object-
injection/](http://vagosec.org/2013/09/wordpress-php-object-injection/)

------
kimmel
I recently moved an application off mysql to postgresql after hitting this
error. After over a decade of mysql doing things their way and it always being
a pain in the ass for everyone else when is enough, enough?

I no longer develop software to run on mysql because I like my sanity.

~~~
rimantas
Interesting, pretty much every system has insane things in it. Is postgresql
an exception?

~~~
sitharus
Certainly not! However, its WTFs per minute count is far lower than MySQL.

I started off my career as a PHP/MySQL developer. PostgreSQL is definitely
years ahead of MySQL. It used to be that MySQL had far superior read
performance for simple queries, but that hasn't been true since around 2006.
Also Postgres' SQL optimiser is really good.

~~~
rimantas

      > PostgreSQL is definitely years ahead of MySQL
    

I guess this does not apply to replication, upserts, etc. Well, I think I will
just grab some popcorn and wait. It will be fun to watch how those who never
bothered to learn MySQL (and by this I don't only mean its peculiarities) will
get bitten for not bothering to learn PostgreSQL.

~~~
gbog
Why? Because you think everything is equal, everything has its own gotchas and
not one is better than the other?

That is against all history, not only in software. Postgres is obvectively
better (safer, better engineered, easier to use, etc.) than MySql, just as
Python is objectively better than PHP, and an AK47 was better than most other
assault rifles of its time.

~~~
the_mitsuhiko
The lack of upserts is a problem in postgres, as much as some people say they
can get away without them. There are objective concurrency issues with the
lack of it, that have been acknowledged by postgres developers when they
started work on upserts.

~~~
einhverfr
> There are objective concurrency issues with the lack of it, that have been
> acknowledged by postgres developers when they started work on upserts.

These concurrency issues don't fully go away by adding the feature, which is
why it is taking time to work out the desired behavior there.

------
nknighthb
> _Why they call this encoding "UTF-8" is beyond me, it most definitely is not
> UTF-8._

This is a confusion of concepts. That an encoding can encode a particular
value does not mean all applications using that encoding will support that
value. You can store any size number you like into a JSON number, but if you
send me a 256-bit value in a field for temperature in celsius, I'm going to
give you back an error.

Encodings ease data exchange, they do not alter the requirements or
limitations of the applications using them. MySQL's original unicode
implementation supported only the Basic Multilingual Plane, so unsurprisingly,
characters outside that plane are rejected. That would be the case regardless
of the particular encoding used.

What you want is support for all Unicode characters. This is a reasonable
request, but it is decoupled from the encoding used to communicate those
characters to MySQL.

~~~
chris_wot
Say what? UTF8 allows for any character that the Unicode standard defines. If
they wanted to restrict a characterset to the BMP, then they should call it
UTF-8-BMP.

------
josteink
That people still _use_ MySQL for anything new is something I find completely
amazing.

But then again, masochism is a reasonably common, if not normal, thing.

~~~
bhaak
IMO this is a combination of "doesn't know any better", "works for me", and
"it's easy enough, don't bother me with complicate solutions".

Which is unfortunate as the "complicate" solution is often just the right
solution but if the wrong solution doesn't blow up into their faces in 99% to
90% of the use cases, many people are satisfied enough with it.

I think these reasons also play a role in PHP's ongoing popularity.

------
jhh
Concerning this issue please also consider Mathias Bynens's article from 2012
[http://mathiasbynens.be/notes/mysql-
utf8mb4](http://mathiasbynens.be/notes/mysql-utf8mb4)

It has a good treatment of how to migrate encoding-wise which includes
additional precautions not mentioned in the originally linked article.

------
lmm
Also, if you set a column charset to "Latin1", guess what encoding MySQL
actually gives you. If you guessed "like windows cp1252, but with 8 random
characters swapped around", have a cookie.

(Though I have to admit, based on the title I was hoping for a deeper problem.
MySQL has plenty of stupid gotchas like this, but it's (marginally) easier to
deal with them than to cluster postgres)

------
marlin
I also hit this issue when saving wikipedia pages to database, since some
languages use these "4-byte UTF-8":s.

AFAICR, there is performance issues about this. Western text does not "need"
this representation and thus the MySQL utf8 handling could be faster.

I was also shocked to learn this while using MySQL 5.3 (where utf8mb4 is
missing), which led me to upgrade to the 5.5 alpha.

From [https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-
utf8...](https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html)

"The character set named utf8 uses a maximum of three bytes per character and
contains only BMP characters.

As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per
character supports supplemental characters"

PS. I have migrated to MariaDB. Take this chance to move away from Oracle
MySQL!

------
girvo
Does anyone know if MariaDB fixes this issue by default?

~~~
ars
It can't. Because of index sizes it would mean incompatibility.

But it's not really all that much of an issue - just use the mb4 character
encoding.

The hard part is knowing you need to :)

------
NKCSS
I can recommend reading the comment from Alex on there as well; it has some
good info on UTF-8:

[quote] 😸 works just fine in PostgreSQL by default (Python too).

The UTF-8 issue is pretty complex, so it's not really a surprise that
implementations (like MySQL's) are incomplete. The basic encoding from the
code points into UTF-8 bytes could handle up to 30 bits of data (first byte of
0xFC indicating 5 following bytes with 6 bits of the code point each (6 bytes
for 36 bit code points are only impossible because UTF-8 bans 0xFF and 0xFE as
byte values other than for the Byte Order Mark). Yet various standards attempt
to restrict the range, currently with RFC 3629 putting the ceiling at 0x10FFFF
(4 bytes per code point). That RFC doesn't bother to justify the constraint,
other than as a backwards compatibility band-aid with the more limited UTF-16.
The point isn't to rag on UTF-16, which made sense once, but to express
sympathy for the various attempts made to avoid coping with the full, 30-bit
range the underlying encoding can actually handle.

Not only is there a wash of hodgepodgery on the range, but UTF-8 (and other
similar encodings) can put small values in large byte encodings. There's
nothing stopping someone from just using a 6-byte long, 30-bit codepoint block
of RAM for each character, even just ASCII (obviously this violates a bunch of
RFCs, but the coding system _does_ provide for it). The result would confuse
many UTF-8 parsers (and blocked by the more complete ones, a Pyrrhic victory),
since ASCII characters are expected to be exactly one byte long in UTF-8, not
_seven_. Even in RFC-compliant UTF-8, an ASCII character codepoint value can
be encoded in four different ways. Example: an "a" (ASCII 97, hex value 0x61)
can be encoded as any of (pardon any bit errors, I'm doing this off-the-cuff,
late at night...):

00111101 - 0x61 - conventional ASCII 11000000 10111101 - 0xCO 0xBD - two-byte
UTF-8 11100000 10000000 10111101 - 0xE0 0x80 0xBD - three byte UTF-8 11110000
10000000 10000000 10111101 - 0xF0 0x80 0x80 0xBD - four byte UTF-8

The latter three overlong encodings aren't considered canonical, and software
devs have to fix them to make string matching efficient and so forth, to save
space, and most worrisomely to prevent attackers from sliding special
characters into strings to crack systems - say by using larger, noncanonical
encodings to evade filters that would catch and block the canonical, shorter
encodings.

Some developers would try to limit UTF-8 characters to four bytes because
that's a power of two, and fits comfortably into a 32-bit long int.

UTF-16 also complicates UTF-8 with the banning of codepoints between 0xd000
and 0xdfff in UTF-8, used as "surrogate pairs" in UTF-16. See
[http://www.unicode.org/version..](http://www.unicode.org/version..). for an
update that mentions this explicitly)

Anyway, the general idea is that I can sympathize with MySQL having incomplete
UTF-8 support, most implementations _are_ incomplete in some way, and one
could argue the RFC's variant is pointlessly incomplete itself (no 24 and 30
bit code points, though apparently UNICODE's Annex D _does_ allow the use of
those two larger sizes for characters outside of the UNICODE range, perhaps
widely interpreted as a constraint of UTF-8 itself instead of being about the
UNICODE subrange of UTF-8). Implementations vary enormously, and with good
reason (see
[http://www.unicode.org/L2/L200..](http://www.unicode.org/L2/L200..). for some
of them).

Fortunately, MySQL's limitation doesn't trouble me, since I do almost all my
work in PostgreSQL or some non-SQL database or another anyway. ;-P [/quote]

~~~
threedaymonk
I don't know; to me it reads like someone trying to show off, but I don't
think it's helpful.

> The latter three overlong encodings aren't considered canonical, and
> software devs have to fix them to make string matching efficient and so
> forth, to save space, and most worrisomely to prevent attackers from sliding
> special characters into strings to crack systems - say by using larger,
> noncanonical encodings to evade filters that would catch and block the
> canonical, shorter encodings.

This is misleading. "Software devs" don't have to "fix them". Overlong
encodings in UTF-8 are invalid since Unicode 3.1, precisely because of the
security considerations, and conformant implementations are required "not to
interpret any ill-formed code unit subsequences" (Unicode Standard section
3.9). RFC 3629 also states that "[i]mplementations of the decoding algorithm
above MUST protect against decoding invalid sequences."

Fortunately, it's pretty straightforward to achieve this, and the Unicode
standard contains a nine-row table listing all valid combinations of ranges of
one to four bytes.

------
j4mie
It'd be really interesting to hear from an engineer at one of the big MySQL
users (I'm thinking Twitter, Facebook, Github) how they solve this problem.

------
peteretep
This bit me pretty hard. I was trying to debug other Unicode issues an
application had, and in my infinite and enduring wisdom decided to use one of
those "turn text upside down" sites to generate sample data. Turns out they
use a bunch of very high codepoints (Mormon text or something) and MySQL
choked on them. Took me ages to figure out that was the issue.

------
mariuz
UTF-8 is handled correctly by Firebird and Flamerobin You can create a new db
with utf-8 charset as default and then you can insert/select

Unicode codepoint U+1F638 a perfectly valid Emoji character

[http://flamerobin.blogspot.ro/2014/02/utf-8-is-handled-
corre...](http://flamerobin.blogspot.ro/2014/02/utf-8-is-handled-correctly-by-
firebird.html)

------
codexon
Does anyone else just see squares instead of whatever unicode symbol it is
supposed to be? I'm using Chrome 33.0.1750.117 m

~~~
ggreer
It's an emoji thumbs-down. You should be able to see it using Safari or
Firefox on OS X. There is an outstanding Chrome issue related to this:
[https://code.google.com/p/chromium/issues/detail?id=62435](https://code.google.com/p/chromium/issues/detail?id=62435)

~~~
ezequiel-garzon
Available here [1] on page 18, or just look for "1F44E". I found the twist
really clever, and first thought the author had simply used a generic block.

I didn't spend much time reading the cited document, but it looks like some
existing glyphs may be merged with newer ones under certain conditions. What a
mess... I feel Unicode has tried to compass way too many things (even
italics!). Go figure...

[1]
[http://www.unicode.org/L2/L2011/11149-webdings.pdf](http://www.unicode.org/L2/L2011/11149-webdings.pdf)

------
olssy
I think it's actually CESU-8 encoding:
[http://www.unicode.org/reports/tr26/](http://www.unicode.org/reports/tr26/)

I'm guessing it's implemented like this for performance reasons and calling it
utf-8 was just a marketing ploy as everyone knows we should always use utf-8
for everything...

------
micheljansen
Interestingly, Hacker News does not show the "thumbs down" emoji in the title.
No UTF-8 support either?

~~~
codeka
I think it was removed by the moderators, it was definitely there before.

------
edu
I'd this same issue a couple of days ago. If you use Rails you can use my
migration: [https://gist.github.com/eduard-
io/9268175](https://gist.github.com/eduard-io/9268175)

------
dudus
Is this also an issue with Google's Cloud SQL or Amazon RDS?

From what I understand these have the same interface as MySQL but internally
are different.

~~~
Xorlev
RDS is just hosted, managed MySQL (or Postgres, or Oracle). Don't know about
Cloud SQL.

------
blablabla123
Unicode support in <insert name of popular library or program that touches
characters> is...

------
DrJ
every time I see a MySQL is XYZ article I think about switching to MariaDB or
PostgreSQL.

------
schappim
Sounds like the OP should read "The Absolute Minimum Every Software Developer
Absolutely, Positively Must Know About Unicode and Character Sets (No
Excuses!)" over at:
[http://www.joelonsoftware.com/articles/Unicode.html](http://www.joelonsoftware.com/articles/Unicode.html)

~~~
oofabz
In this article, Joel says he "decided to do everything internally in UCS-2
(two byte) Unicode". He fell into the same trap that MySQL did and the
software he describes would also fail on Emoji.

It is still a very informative piece and Joel was way ahead of the curve by
supporting and evangelizing Unicode at all in 2003. But it is not the best
article to point the OP at, as it does not mention the BMP or discuss proper
handling of characters beyond the BMP.

~~~
jwarkentin
I never got that. He wrote a great article about the issue and then ended with
using UCS-2. I always wondered if there was something I missed that made him
choose UCS-2 over UTF-8, since UTF-8 can represent every Unicode code point.

~~~
rb12345
I imagine it's linked to Win32 and .Net using UCS-2 internally, since then
there's no need to convert strings before and after API calls.

