Hacker News new | past | comments | ask | show | jobs | submit login
Unicode support in MySQL is... (codeka.com.au)
157 points by codeka on Feb 28, 2014 | hide | past | web | favorite | 107 comments

Now presumably there is some drawback from doing this, otherwise "utf8mb4" would be the default (right?) but I'll be damned if I can figure out what the drawback is.

There are a couple, and they are insidious. From http://geoff.greer.fm/2012/08/12/character-encoding-bugs-are...:

This is when I discovered that InnoDB limits index columns to 767 bytes. Why is this suddenly an issue? Because changing the charset also changes the number of bytes needed to store a given string. With MySQL’s utf8 charset, each character could use up to 3 bytes. With utf8mb4, that goes up to 4 bytes. If you have an index on a 255 character column, that would be 765 bytes with utf8. Just under the limit. Switching to utf8mb4 increases that index column to 1020 bytes (4 * 255).

And later:

You might say, “OK, now we’re finished, right?” Ehh… it’s not so simple. MySQL’s utf8 charset is case-insensitive. The utf8mb4 charset is case-sensitive. The implications are vast. This change in constraints forces you to sanitize the data currently in your database, then make sure you don’t insert anything with the wrong casing. At work, there was much weeping and gnashing of teeth.

Good luck. The real solution is to switch to something that isn't MySQL. The project in that post ended up switching to Cassandra. For new SQL stuff, I now use PostgreSQL.


To those of you who point out that utf8mb4_unicode_ci is case-insensitive, we did try that. This whole fiasco happened two years ago, and I don't remember what the show-stopper was, but we decided not to use that collation. Whatever issues we had with it were bad enough to warrant switching to utf8mb4_bin and sanitizing everything. Really, the solution is to switch away from MySQL and forget about collations and encodings.

> For new SQL stuff, I now use PostgreSQL.

That's so true! Years ago I switched almost all projects to PostgreSQL. I never looked back. I use MySQL only for very legacy applications.

MySQL has so many bugs an inconsistencies, unicode support is only one of many issues. (Unsafe GROUP BY, silently cutting concatenated text fields when the result grows too big, silently converting invalid dates to '0000-00-00', and so on ...)

In the past, MySQL had some performance advantages over PostgreSQL, but only for MyISAM, i.e. without real transaction safety. Nowadays, if you want to trade ACID for speed, you wouldn't use MySQL but some "NoSQL" database instead. So even from a performance point of view, I see no point in using MySQL for anything but legacy stuff.

I switched to sqlite. I've done my time on SQL servers. In my view, they simply have too much complexity for many scenarios. Basically, one big database server equals one big downtime when issues occur. Sharded data architecture on the other hand can design for failure and maintain graceful degraded performance, not to mention optimized resource allocation and the option of adopting vastly different security and backup policies across different parts of the datastore (eg. encryption, snapshots, etc.). Yes, you lose database-internal consistency guarantees. No, that doesn't mean you have to lose consistency.

I did that in 2006. A big win was maintenance, since downloading/uploading a copy of the database was so simple. Eventually I had performance problems when doing read-write-read-write stuff, since writing locks the entire db file.

You could argue I shouldn't be doing that, and you may be right. But I just switched back to MySQL, and the problem went away.

SQLite fixed that issue in 2010 with write ahead logging (version 3.7). Writes go to a separate file so there is no impact on readers of the main file. It is a little more complex than that and described at https://www.sqlite.org/wal.html

very legacy?

much inconsistent.

One nifty feature of Postgresql is that if you have a group by a,b,c,d, you need to aggregate everything else in your select that isn't a/b/c/d, or the statement won't prepare, whereas if you don't select an aggregate in MySQL you'll just get one random column, which is probably not what you want.

This is mostly true, but in recent Postgres versions (since 9.1 I believe), if you GROUP BY the primary key, you can still include other columns in the SELECT statement. For instance:

    SELECT a.name, COUNT(b.id)
    FROM   a
    LEFT OUTER JOIN b ON b.a_id = a.id
    GROUP BY a.id
Pretty nice!

For anyone thinking of using utf8_general_ci, it's also seriously broken if you care about sameness of characters at all.

For example, utf8_general_ci treats u and Ü as the same character. They are very much not the same. Your sort orders for international texts will be all over the place if you use utf8_general_ci.

Use utf8_unicode_ci instead.

utf8_general_ci is a lot faster than utf8_unicode_ci and it's good enough for whenever you don't need to sort international strings.

Good enough when you ignore non English speakers is not good enough

For a lot of uses cases it very much is. Even ascii can work for some. Not all tables/DBs have to do with spoken language data.

But a lot of databases have to deal with people's names. To paraphrase an old saying, "I don't care why I'm in the database, as long as they spell my name right."

utf8_general_ci will still spell your name correctly, it just might not sort it correctly if you use an ORDER BY

For highly multilingual databases sort order is basically a lost cause anyway, because there is no globally correct, multilingual Unicode sort order. Even when using the same alphabet different languages have different sorting conventions. And some languages' correct sorting order is not even fully decidable solely from the UTF8 text. For example, Danish treats 'aa' as a sequence of two 'a' letters sometimes (mainly in words and names of non-Danish origin), and as a variant spelling of 'å' other times. So in a properly collated Danish encyclopedia, Aachen goes near the beginning of the encyclopedia, while Aalborg goes near the end. Good luck implementing that in your database!

The unicode people have addressed this with the Unicode Collation Algorithm - http://www.unicode.org/reports/tr10/ - which obviously can't be perfect, but it can be reasonable.

The ICU project - http://site.icu-project.org/ - has open source implementations of the collation algorithm including appropriate information for different locales. ie you shows those Danish names to a Danish user in their expected sort order, while also showing them to an American in their expected sort order.

i18n and l10n is hard. But it is also largely solved fairly well, and there is no excuse to avoid it all together or not use the ICU code.

In that case, why not use latin1? Or hey, why not just ASCII?

We're discussing how well each collation SORTS international strings, not whether or not you can store them. There are a lot of instances where you need to be able to store an international string without needing to be able to do a highly accurate ORDER BY on that column.

It's also worth mentioning that by design, UTF-8 doesn't use any more space for storage than ASCII. There are exceptions when databases need to pre-allocate storage, but in general, you should just be using UTF-8 everywhere.

The effect of various marks on whether it's the same letter or not depends on locale. You need to change the sorting based on context.

Don't hardcode utf8_unicode_ci either.

The case sensitive part is incorrect, the author made a mistake and corrects it in the comments, but not in the article (don't know why).

Case insensitive works fine with mb4.

Erm, what? You are the author! So why would you copy/paste a section you know is incorrect? And an even bigger question is why would you leave it on your website to confuse lots and lots of people!

Fix it both here and in the article!

Someone pointed out in the comments on that link that utf8mb4 isn't necessarily case sensitive. The author believed that due to the fact that he was using utf8mb4_bin_ci as collation, which is a binary collation. It works as expected when using a "true" collation like utf8mb4_general_ci.

The index problem is mildly annoying, but limiting index length isn't a big problem. For most applications, you can uniquely identify a row with much less than 191 characters, and it might be a good optimization to limit index length regardless to prevent the index size from growing out of control. If it's truly a problem, hacks like adding a sha1 or md5 hash column with a value derived from the original column and indexing that instead solves the problem.

Edit: Ops. Didn't realize parent was the article author. :)

There is a case insensitive collation for utf8mb4 so half of the problems there are just a misconfiguration. The index size restrictions are a problem though.

There is a utf8mb4_unicode_ci collation, but we had big problems with it. I'm sorry, but this happened two years ago and I can't remember what the exact issues were. Whatever they were, they were bad enough that we ended up using utf8mb4_bin and sanitizing everything.

Edit: Apologies, I wrote utf8mb4_general_ci when I meant to say utf8mb4_bin. I think the issue we had was conflicting rows on unique constraints due to utf8mb4_unicode_ci having different comparison rules than utf8_unicode_ci.

Don't want to be confused by utf8 vs utf8mb4 charsets or deal with the subtle differences between utf8_general_ci, utf8_unicode_ci, utf8mb4_general_ci, utf8mb4_unicode_ci, and utf8mb4_bin? Switch to PostgreSQL. You will not regret it.

That sounds very much like FUD. There is nothing inherently magical about utf8mb4_unicode_ci.

A vague problem like this is really not cool. At the minimum your readers deserve that you update the article.

Say that there is a case insensitive version. If you want add that you had some problem but that you don't know if it had anything to do with it or not.

The problem you had was probably because of the difference between utf8mb4_general_ci and utf8mb4_unicode_ci.

This is also the reason why MediaWiki have been using binary blobs for text data in their MySQL database. And have been for years.

Might not be any different, but I believe MediaWiki runs on MariaDB:


At that point they should really consider switching to another DB imo.

Perhaps. But I think they are keeping MySQL support both for legacy reasons and compatibility reasons. You could also argue they should stop implementing it in PHP, if they are willing to sacrifice its ease of install.

But MediaWiki does support other DBs, such as Postgresql and Sqlite.

Is there a phpmyadmin thing for PostgreSQL?


Cool, thanks for the pointer

Anther nice "feature" of mysql: If you have a table configured to UTF-8, and you use utf-8 client encoding but then accidentally try to send something that's not UTF-8 (shit happens), then mysql will happily accept your data and truncate it on the first byte it doesn't understand.

It issues a warning, but most frameworks don't care to inform the caller if a warning has happened (some don't even provide a way to access them).

Yes. You should not be sending invalid data to your database, but holy sh*t, your database shouldn't (mostly) silently alter the data you entrust it with.

If the input data is wrong and you can't deal with it, blow up in the face of the user. Don't try to "fix" it by corrupting it.

Disclaimer: This happened to me in 2008 (http://pilif.github.io/2008/02/failing-silently-is-bad/), so it might have been fixed since, but stuff like this made me lose my trust in MySQL long ago.

Except for causing annoyances, the truncation of data may also have security implications. Last year, I found that WordPress (which has the utf8 character set by default), was vulnerable to PHP object injection because their filter could be bypassed by (ab)using the truncation of strings in MySQL. For more details: http://vagosec.org/2013/09/wordpress-php-object-injection/

DBIx::Class (in perl) has a connection info option that calls a method that does its best to turn all of that horrible crap off:

    sub connect_call_set_strict_mode {
      my $self = shift;

      # the @@sql_mode puts back what was previously set on the session handle
      $self->_do_query(q|SET SQL_MODE = CONCAT('ANSI,TRADITIONAL,ONLY_FULL_GROUP_BY,', @@sql_mode)|);
      $self->_do_query(q|SET SQL_AUTO_IS_NULL = 0|);
Please feel free to steal it if it helps :)

> but most frameworks don't care to inform the caller if a warning has happened (some don't even provide a way to access them).

Then I would consider the framework broken. Warnings about truncated data are the norm in other RDBMSs.

I recently moved an application off mysql to postgresql after hitting this error. After over a decade of mysql doing things their way and it always being a pain in the ass for everyone else when is enough, enough?

I no longer develop software to run on mysql because I like my sanity.

Well, funnily enough MySQL are now owned by Oracle and I've wasted a couple of days trying to get some sane semblance of `LIMIT` working for me on Oracle... so don't hold your breath, Oracle have an even longer history of damning their customers...

I've been keeping a list of MySQL gotchas, but steering clear of the MySQL take on the SQL standard itself: http://williamedwardscoder.tumblr.com/post/25080396258/oh-my...

I still contemplate MySQL for new projects, because better the enemy you know and all that... OMG, I'm mad! Postgres rah rah rah!

Oracle does seems to make limit queries much harder than they need to be. I wrote an article about limit and paging queries in oracle ages ago that may interest you, but you probably have it all figured out now.


This actually made me think of: Oracle's "UTF8" vs "AL32UTF8" which I consider to be a gotcha, even though it is reasonable in terms of the evolution of the standard, a modern land mine perhaps.

The lack of limit sucks, but the thing that annoys me most regularly is that Oracle refuses to make the 0 length string / null behavior configurable to allow for ANSI compliance ( MySQL null handling is worse IMO, makes me insane.... )

Oracle historically has poor support. Here's an old example with an ADO bug, they never fixed it even though the support ticket was open for almost 2 years!


and unfortunately, they are the only ones who can offer support or implement a patch for their software!

With postgres and mysql you can find many companies competing to offer great service, and who can contribute their fixes and improvements directly should they so choose. The stronger the output of these communities and the organizations that support them, the more pressure for even the companies like Oracle to improve ( or perish :))

You should be able to do LIMIT in Oracle using http://www.oracle.com/technetwork/issue-archive/2006/06-sep/... . If you want a standards-compliant way to do it, you can use the ROW_NUMBER() window function. This might not be as fast, but will work on virtually every DB except MySQL...

Yes I'm deep beyond correctness and now want to make it fast...

Oracle is like COBOL; the only techies who like it are those whose job security depends upon it ;)

Yep, I switched to Postgresql after seeing a client lose data to corrupted databases - they had rebooted their server, and mysql refused to restart as a result and had to be manually recovered. Every time I hear another mysql horror story like this (truncation without warning was another favourite), it makes me thankful I no longer use it.

if it matters, must have backups! ( +replication +geographical distribution etc as needs must )

Even a perfect database, should such a thing ever exist, is not sufficient to protect us from a determined disaster

Sure, I came in after their disaster to clean up, and their issue was their backups weren't up to date and they had no replication etc (it was a small site, no sysadmin, they rebooted their server during a busy time and boom! All the other services came up ok, but not mysql). The fact that Mysql failed and ended up with corrupted dbs on a simple reboot (not power failure) doesn't fill me with confidence - that really shouldn't happen. That's just an anecdote of course, but I think I prefer working with psql anyway.

Interesting, pretty much every system has insane things in it. Is postgresql an exception?

Certainly not! However, its WTFs per minute count is far lower than MySQL.

I started off my career as a PHP/MySQL developer. PostgreSQL is definitely years ahead of MySQL. It used to be that MySQL had far superior read performance for simple queries, but that hasn't been true since around 2006. Also Postgres' SQL optimiser is really good.

  > PostgreSQL is definitely years ahead of MySQL
I guess this does not apply to replication, upserts, etc. Well, I think I will just grab some popcorn and wait. It will be fun to watch how those who never bothered to learn MySQL (and by this I don't only mean its peculiarities) will get bitten for not bothering to learn PostgreSQL.

Why? Because you think everything is equal, everything has its own gotchas and not one is better than the other?

That is against all history, not only in software. Postgres is obvectively better (safer, better engineered, easier to use, etc.) than MySql, just as Python is objectively better than PHP, and an AK47 was better than most other assault rifles of its time.

The lack of upserts is a problem in postgres, as much as some people say they can get away without them. There are objective concurrency issues with the lack of it, that have been acknowledged by postgres developers when they started work on upserts.

> There are objective concurrency issues with the lack of it, that have been acknowledged by postgres developers when they started work on upserts.

These concurrency issues don't fully go away by adding the feature, which is why it is taking time to work out the desired behavior there.

may be a problem, there are workarounds, and workloads where it is not required, but I agree, merge/upsert is a much anticipated enhancement.

I find the official documentation to be superior for postgres (YMMV), but I think there are certainly more examples / blogs / tutorials for MySQL.

I prefer to work with Postgres when I can, but I also have found instances where MySQL better fit my needs, a heap table, even with more recently added index only scans ( and even reordering the table with cluster may not be desirable) you may want an index organized table (oracle term ) (mysql would say clustered index )

I am curious why even in the context of freely available engines we focus on competition, which I think can lead to great things, but I don't understand the animosity the communities feel towards one another, isn't it just a matter of horses for courses? Are we really interested in competitive kills or improving both the tools we use and our understanding thereof ?

I don't mean to single your comment out gbog, and I know this is certainly not exclusive to databases either ( witness almost any discussion of various programming languages on HN )

The MySQL clustered index (and Oracle IOT) are, in my opinion, often over looked powerful features. To pick an example I recently encountered ... I have an Rails app to implement a RSS reader. Each user has many feeds that get added over time. If you cluster the feeds on user_id then all the feeds for a given user will be in a single database block or two, so reading them out is super efficient. If they are in a standard heap table, (or the way Rails creates the table by default, clustered on ID), then reading the list of N feeds could require N database block reads in the extreme case. Same goes for all articles on a given feed - cluster the related articles together and its very efficient to find the most recent articles for a feed.

in at least some cases, animosity comes from people who've have to clean up someone else's exploded mess in the tech of their targeted derision.

I do think many (most?) people are interested in seeing all tools get 'better' over time, but what needs to be improved, or how it should be improved, often don't get agreed on.

i don't think you know what objectively means.

In my experience, it has fewer "insane things in it" than MySQL.

PostgreSQL is no exception, but hey I do insane things with the db so I suppose it is fitting that I occasionally (but rarely) run into insane aspects of the system.

The difference is, you don't have to do crazy things to run into truly insane behavior on MySQL ;-)

Postgres may not be perfect, but it's certainly not as consistently dopey as MySQL.

I basically chose MySQL because of familiarity for the most part, but I think I'll definitely consider an alternative for my next project...

For me it was because AWS RDS didn't support Postgres until recently, although that has changed now.

> Why they call this encoding "UTF-8" is beyond me, it most definitely is not UTF-8.

This is a confusion of concepts. That an encoding can encode a particular value does not mean all applications using that encoding will support that value. You can store any size number you like into a JSON number, but if you send me a 256-bit value in a field for temperature in celsius, I'm going to give you back an error.

Encodings ease data exchange, they do not alter the requirements or limitations of the applications using them. MySQL's original unicode implementation supported only the Basic Multilingual Plane, so unsurprisingly, characters outside that plane are rejected. That would be the case regardless of the particular encoding used.

What you want is support for all Unicode characters. This is a reasonable request, but it is decoupled from the encoding used to communicate those characters to MySQL.

Say what? UTF8 allows for any character that the Unicode standard defines. If they wanted to restrict a characterset to the BMP, then they should call it UTF-8-BMP.

When the purpose of your software is simply to record data, saying you support a particular encoding when you only support _some_ of it is pretty irresponsible.

> That an encoding can encode a particular value does not mean all applications using that encoding will support that value.

And yet you have to change the encoding of a column to support storage of non-BMP characters, even though the UTF-8 encoding supports them just fine. Where's the sense in that?

No, you have to change the character set of a column. It's unfortunate they used a name normally used for an encoding to name a character set, but that doesn't make the character set an encoding.

It seems to me that you're the one confusing concepts. MySQL originally didn't support characters outside the BMP, fine. But since the encoding is completely different from that, a real UTF-8 would have automatically supported astral characters as soon as MySQL supported them.

The fact that MySQL itself gained support for all of unicode, but did not use this support in the 'utf8' encoding, shows that it is not actually UTF-8.

but I expect an app that says "you can store utf8 data here" to actually store that, not "I will silently drop everything after the first utf8 character that doesn't fit in 3 bytes".

I don't see how I should imagine that putting utf8 as a column encoding should be interpreted as "I will read and discard utf8".

This is different from declaring utf8 as the connection encoding.

Your expectation is common among westerners. It typically goes away within a few days of beginning to work with large amounts of CJK content.

When someone says they accept UTF-8, UCS-2, or any other encoding, my first question is always "Are characters outside the Basic Multilingual Plane supported?". There is a long history quite apart from MySQL of the answer to that question being "no", particularly if the project in question originated in the 90s (or even early 2000s).

That people still use MySQL for anything new is something I find completely amazing.

But then again, masochism is a reasonably common, if not normal, thing.

IMO this is a combination of "doesn't know any better", "works for me", and "it's easy enough, don't bother me with complicate solutions".

Which is unfortunate as the "complicate" solution is often just the right solution but if the wrong solution doesn't blow up into their faces in 99% to 90% of the use cases, many people are satisfied enough with it.

I think these reasons also play a role in PHP's ongoing popularity.

Last I checked the alternatives were either very expensive Oracle or did not have a good HA solution Postgres... MySQL/MariaDB has two good HA solutions - either behind haproxy with master-master or a VIP+keepalived with Galera cluster... The issue of encodings for most applications is minor...

You mean masochism like going for the "better" solution when you need it for your use case?

Concerning this issue please also consider Mathias Bynens's article from 2012 http://mathiasbynens.be/notes/mysql-utf8mb4

It has a good treatment of how to migrate encoding-wise which includes additional precautions not mentioned in the originally linked article.

Also, if you set a column charset to "Latin1", guess what encoding MySQL actually gives you. If you guessed "like windows cp1252, but with 8 random characters swapped around", have a cookie.

(Though I have to admit, based on the title I was hoping for a deeper problem. MySQL has plenty of stupid gotchas like this, but it's (marginally) easier to deal with them than to cluster postgres)

I also hit this issue when saving wikipedia pages to database, since some languages use these "4-byte UTF-8":s.

AFAICR, there is performance issues about this. Western text does not "need" this representation and thus the MySQL utf8 handling could be faster.

I was also shocked to learn this while using MySQL 5.3 (where utf8mb4 is missing), which led me to upgrade to the 5.5 alpha.

From https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8...

"The character set named utf8 uses a maximum of three bytes per character and contains only BMP characters.

As of MySQL 5.5.3, the utf8mb4 character set uses a maximum of four bytes per character supports supplemental characters"

PS. I have migrated to MariaDB. Take this chance to move away from Oracle MySQL!

Does anyone know if MariaDB fixes this issue by default?

It can't. Because of index sizes it would mean incompatibility.

But it's not really all that much of an issue - just use the mb4 character encoding.

The hard part is knowing you need to :)

It does not.

I can recommend reading the comment from Alex on there as well; it has some good info on UTF-8:

[quote] 😸 works just fine in PostgreSQL by default (Python too).

The UTF-8 issue is pretty complex, so it's not really a surprise that implementations (like MySQL's) are incomplete. The basic encoding from the code points into UTF-8 bytes could handle up to 30 bits of data (first byte of 0xFC indicating 5 following bytes with 6 bits of the code point each (6 bytes for 36 bit code points are only impossible because UTF-8 bans 0xFF and 0xFE as byte values other than for the Byte Order Mark). Yet various standards attempt to restrict the range, currently with RFC 3629 putting the ceiling at 0x10FFFF (4 bytes per code point). That RFC doesn't bother to justify the constraint, other than as a backwards compatibility band-aid with the more limited UTF-16. The point isn't to rag on UTF-16, which made sense once, but to express sympathy for the various attempts made to avoid coping with the full, 30-bit range the underlying encoding can actually handle.

Not only is there a wash of hodgepodgery on the range, but UTF-8 (and other similar encodings) can put small values in large byte encodings. There's nothing stopping someone from just using a 6-byte long, 30-bit codepoint block of RAM for each character, even just ASCII (obviously this violates a bunch of RFCs, but the coding system does provide for it). The result would confuse many UTF-8 parsers (and blocked by the more complete ones, a Pyrrhic victory), since ASCII characters are expected to be exactly one byte long in UTF-8, not seven. Even in RFC-compliant UTF-8, an ASCII character codepoint value can be encoded in four different ways. Example: an "a" (ASCII 97, hex value 0x61) can be encoded as any of (pardon any bit errors, I'm doing this off-the-cuff, late at night...):

00111101 - 0x61 - conventional ASCII 11000000 10111101 - 0xCO 0xBD - two-byte UTF-8 11100000 10000000 10111101 - 0xE0 0x80 0xBD - three byte UTF-8 11110000 10000000 10000000 10111101 - 0xF0 0x80 0x80 0xBD - four byte UTF-8

The latter three overlong encodings aren't considered canonical, and software devs have to fix them to make string matching efficient and so forth, to save space, and most worrisomely to prevent attackers from sliding special characters into strings to crack systems - say by using larger, noncanonical encodings to evade filters that would catch and block the canonical, shorter encodings.

Some developers would try to limit UTF-8 characters to four bytes because that's a power of two, and fits comfortably into a 32-bit long int.

UTF-16 also complicates UTF-8 with the banning of codepoints between 0xd000 and 0xdfff in UTF-8, used as "surrogate pairs" in UTF-16. See http://www.unicode.org/version... for an update that mentions this explicitly)

Anyway, the general idea is that I can sympathize with MySQL having incomplete UTF-8 support, most implementations are incomplete in some way, and one could argue the RFC's variant is pointlessly incomplete itself (no 24 and 30 bit code points, though apparently UNICODE's Annex D does allow the use of those two larger sizes for characters outside of the UNICODE range, perhaps widely interpreted as a constraint of UTF-8 itself instead of being about the UNICODE subrange of UTF-8). Implementations vary enormously, and with good reason (see http://www.unicode.org/L2/L200... for some of them).

Fortunately, MySQL's limitation doesn't trouble me, since I do almost all my work in PostgreSQL or some non-SQL database or another anyway. ;-P [/quote]

I don't know; to me it reads like someone trying to show off, but I don't think it's helpful.

> The latter three overlong encodings aren't considered canonical, and software devs have to fix them to make string matching efficient and so forth, to save space, and most worrisomely to prevent attackers from sliding special characters into strings to crack systems - say by using larger, noncanonical encodings to evade filters that would catch and block the canonical, shorter encodings.

This is misleading. "Software devs" don't have to "fix them". Overlong encodings in UTF-8 are invalid since Unicode 3.1, precisely because of the security considerations, and conformant implementations are required "not to interpret any ill-formed code unit subsequences" (Unicode Standard section 3.9). RFC 3629 also states that "[i]mplementations of the decoding algorithm above MUST protect against decoding invalid sequences."

Fortunately, it's pretty straightforward to achieve this, and the Unicode standard contains a nine-row table listing all valid combinations of ranges of one to four bytes.

It'd be really interesting to hear from an engineer at one of the big MySQL users (I'm thinking Twitter, Facebook, Github) how they solve this problem.

This bit me pretty hard. I was trying to debug other Unicode issues an application had, and in my infinite and enduring wisdom decided to use one of those "turn text upside down" sites to generate sample data. Turns out they use a bunch of very high codepoints (Mormon text or something) and MySQL choked on them. Took me ages to figure out that was the issue.

UTF-8 is handled correctly by Firebird and Flamerobin You can create a new db with utf-8 charset as default and then you can insert/select

Unicode codepoint U+1F638 a perfectly valid Emoji character


Does anyone else just see squares instead of whatever unicode symbol it is supposed to be? I'm using Chrome 33.0.1750.117 m

It's an emoji thumbs-down. You should be able to see it using Safari or Firefox on OS X. There is an outstanding Chrome issue related to this: https://code.google.com/p/chromium/issues/detail?id=62435

Available here [1] on page 18, or just look for "1F44E". I found the twist really clever, and first thought the author had simply used a generic block.

I didn't spend much time reading the cited document, but it looks like some existing glyphs may be merged with newer ones under certain conditions. What a mess... I feel Unicode has tried to compass way too many things (even italics!). Go figure...

[1] http://www.unicode.org/L2/L2011/11149-webdings.pdf

On Windows and with Chrome, yes.

On iOS 7.0.1 it's a colorful thumbs down.

Internet Explorer 11 shows a black/white thingie which might be a thumbs down with a lot of imagination. Firefox shows the same bitmap.

You need a font which supports emoji. Android, iOS and Mac OS X should support it out of the box, on Linux I had to install a font. I'm not sure what the situation is on Windows.

It's a Unicode thumbs down sign. You must not have any fonts with that glyph. It works fine for me in both Firefox 27 and IE 11 on Windows 8.1

I'm pretty sure that's intented

I think it's actually CESU-8 encoding: http://www.unicode.org/reports/tr26/

I'm guessing it's implemented like this for performance reasons and calling it utf-8 was just a marketing ploy as everyone knows we should always use utf-8 for everything...

Interestingly, Hacker News does not show the "thumbs down" emoji in the title. No UTF-8 support either?

I think it was removed by the moderators, it was definitely there before.

I'd this same issue a couple of days ago. If you use Rails you can use my migration: https://gist.github.com/eduard-io/9268175

Is this also an issue with Google's Cloud SQL or Amazon RDS?

From what I understand these have the same interface as MySQL but internally are different.

RDS is just hosted, managed MySQL (or Postgres, or Oracle). Don't know about Cloud SQL.

Unicode support in <insert name of popular library or program that touches characters> is...

every time I see a MySQL is XYZ article I think about switching to MariaDB or PostgreSQL.

Sounds like the OP should read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" over at: http://www.joelonsoftware.com/articles/Unicode.html

In this article, Joel says he "decided to do everything internally in UCS-2 (two byte) Unicode". He fell into the same trap that MySQL did and the software he describes would also fail on Emoji.

It is still a very informative piece and Joel was way ahead of the curve by supporting and evangelizing Unicode at all in 2003. But it is not the best article to point the OP at, as it does not mention the BMP or discuss proper handling of characters beyond the BMP.

I never got that. He wrote a great article about the issue and then ended with using UCS-2. I always wondered if there was something I missed that made him choose UCS-2 over UTF-8, since UTF-8 can represent every Unicode code point.

I imagine it's linked to Win32 and .Net using UCS-2 internally, since then there's no need to convert strings before and after API calls.

Actually, Joel's article is also outdated, e.g. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.


What is the maximum number of bytes for a UTF-8 encoded character?


Why? UTF8 covers more than just the BMP, it's just an encoding scheme. This has nothing to do with the fact that UTF8 is restricting characters to 0xFFFF!

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact