
In MySQL, don’t use “utf8”, use “utf8mb4” (2016) - mariuz
https://medium.com/@adamhooper/in-mysql-never-use-utf8-use-utf8mb4-11761243e434
======
chrismorgan
Possibly my favourite thing about the rise of emoji is that they’re not Basic
Multilingual Plane, and so stupid hacks like MySQL’s utf8 character set (I
seriously don’t know why anyone ever thought it was a good idea in 2002–2004)
are now _obviously_ insufficient for full Unicode support, and enough people
_care_ about them that adoption of this basic level of Unicode is driven, and
so non-English speakers are inadvertently helped.

The next hurdle is to convince people that extended grapheme clusters are the
smallest divisible unit, and that just like slicing at the byte level is
insufficient if you’re doing UTF-8, slicing at the code point level is
insufficient. Basically, strings should just be treated as opaque data in
almost all places.

~~~
chrismorgan
While I’m contemplating why people might have thought utf8mb3 was wise in
2004, I’ll ask a similar question that I’d love to know the answer to, of
events from my early childhood: why did UTF-16 gain popularity? I get UCS-2
existing, but once it became evident it wasn’t enough, why make UTF-16 and
ruin Unicode with surrogate pairs (which still cause trouble today, leading to
things like WTF-8 to make it possible for web browsers to abandon their
UTF-16/UCS-2 mess in favour of almost-UTF-8)? UTF-8 was around by early 1993,
while UTF-16 was developed several years later and seems fairly obviously the
worst of both worlds—twice as large for normal text, ASCII-incompatible, and
not even a fixed-width encoding, the _only thing_ it almost had over UTF-8.
The only thing it had going for it was that it was _roughly_ compatible with
UCS-2; but was UCS-2 _that_ popular at the time?

(Concerning UCS-2, hindsight says it should have been obvious 65,536 code
points was going to be insufficient, but I’ll overlook that as an easy mistake
to make; that part is better documented than the rise of UTF-16.)

Possibly the most helpful description I have is from
[https://en.wikipedia.org/wiki/UTF-16#History](https://en.wikipedia.org/wiki/UTF-16#History),
but even that doesn’t explain why it became _popular_. Why did Java adopt it,
Windows, browsers, _& c._? _Why?_

~~~
bradleyjg
With utf-8 you give up any hope of O(1) string operations. With utf-16 you can
get O(1) string operations at a cost of doing them on code points rather than
graphemes. If greater than 95% of the strings your language/program will ever
see are going to be a single code point that trade off seems worth it.

I think a lot of people totally discount the enormous cost that Latin-1 users
are paying for CJK (etc) support they don’t use. Maybe it’s the right call,
but it isn’t _obviously_ so.

Personally, at least for internal use, I like the idea of specialized
implementations. At (immutable) string creation time decide whether to
construct a Latin-1 string or a UTF-8 string (and possibly other fixed width
encodings) depending on what code points are present. Expose the same set of
operations (i.e. slicing on grapheme clusters) but with better performance
characteristics where possible.

For storage at rest and especially communication with external programs I
agree you probably should emit UTF-8.

~~~
pwagland
Except that with utf-16 you still have to give up any hope of O(1) string
operations, unless you don't care about accuracy. Only UTF-32 allows O(1)
index based operations.

For example, 🧐 (aka face with monocle) needs 2 UTF-16 'characters', since the
UTF code point is U+1F9D0, or, in UTF-8: F0 9F A7 90.

And this ignored the point that others have made about combining characters,
such as ‍ woman farmer Unicode: U+1F469 U+1F3FD U+200D U+1F33E, UTF-8: F0 9F
91 A9 F0 9F 8F BD E2 80 8D F0 9F 8C BE

That beast needs 7 UTF-16 or 15 UTF-8 "characters". I.e. 14 bytes vs 15 bytes.
And this is still only one displayed character.

~~~
kyle-rb
To clarify the woman farmer being 14 or 15 bytes: the example given is four
different combined characters, and is additionally a "medium skin tone woman
farmer".

A woman farmer requires the "woman" and "ear of rice" emoji, with a zero-width
joiner character between them. To change the skin tone from the default
yellow, a "medium skin tone" (type 4) modifier is added after the woman, but
before the joiner.

So the sequence "U+1F469 U+1F3FD U+200D U+1F33E" represents "woman skin-tone-4
joiner ear-of-rice". And in the UTF-8 bytes, the first four bytes are "woman",
the next four are "skin-tone-4", then three for "zero-width-joiner", and
finally four for "ear-of-rice".

Emojipedia helpfully lists codepoints for each multi-character emoji:
[https://emojipedia.org/female-farmer/](https://emojipedia.org/female-farmer/)

------
morgo
Hi! Former Product Manager for MySQL Server here. I have a blog post on this
here: [https://mysqlserverteam.com/mysql-8-0-when-to-use-
utf8mb3-ov...](https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-over-
utf8mb4/)

The default for MySQL 8.0 (current GA) is utf8mb4. I also recommend reading
this introduction post here: [http://mysqlserverteam.com/sushi-beer-an-
introduction-of-utf...](http://mysqlserverteam.com/sushi-beer-an-introduction-
of-utf8-support-in-mysql-8-0/)

~~~
JdeBP
> _An optimization was chosen to limit utf8 to 3 bytes_

You chose not to define the UNICODE_32BIT macro that was in the original
source from 2002?

* [https://github.com/mysql/mysql-server/commit/55e0a9cb01af4b0...](https://github.com/mysql/mysql-server/commit/55e0a9cb01af4b01bc4e4395de9e4dd2a1b0cf23#diff-95de3baa9e399c689954d1b8442d0be8R1611)

~~~
morgo
The optimization was made at a different time in history. As I wrote in
another comment here: while the decision is regrettable, because of a desire
for backwards compatibility and simplified upgrade there wasn't an easy way
out of it.

The utf8[mb4] support in MySQL 8.0 is much better than prior releases. The
team worked really hard on eliminating the use cases for utf8mb3 (performance)
so that "just use the default of utf8mb4" can work for everyone.

~~~
jeltz
Is it really an optimization? I always assumed they picked 3 bytes to make
sure a varchar(255) still would fit within the InnoDB field size limit of 768
bytes, so they could avoid/postpone having to fix InnoDB to support variable
size fields.

~~~
morgo
This decision was made before InnoDB was the default.

While the performance differences are historical now, consider the case that
because of fixed sized buffers, one query could switch from a temp table in
memory to a temp table on disk. The profile of the query has effectively
changed.

~~~
jeltz
Yes, but did it happen before InnoDB was widely used?

------
Dylan16807
Extremely important: If you use mysqldump, _make sure it 's also using
utf8mb4_. There is a very high chance it defaults to utf8 and silently
corrupts your data.

~~~
crazygringo
Oh my god, this is both terrifying and could one day prove to be the most
valuable comment I've seen on HN.

I love MySQL, but the fact that they didn't make utf8mb4 the default for
mysqldump (instead of utf8), when they created utf8mb4 for database use... is
one of the most irresponsible and negligent pieces of engineering I've seen in
a long time, considering the amount of data worldwide backed up with
mysqldump.

The fact that a program designed solely to back up databases would default to
silent corruption like this is mind-boggling. At least they've fixed this in
the future for MySQL 8.0.

~~~
WrtCdEvrydy
Imagine how many companies out there are relying on backups that are already
corrupted...

is your company one of them?

~~~
ifoundthetao
This is an important reason why testing your backups is a critical part of
backing up.

~~~
mmt
This is oft-repeated advice, but, in this context, it ends up being little
more than a platitude.

Since the complaint here is of corruption that is _silent_ , the level of
testing required to catch it would be extraordinary.

------
acabal
This bit me hard many years ago in possibly the most frustrating bug I've
encountered--frustrating because of the _stupidity_ of MySQL's implementation.
I had a utf8 MySQL database that was created back before utf8mb4 existed--this
is, back in the days when emojis were _just_ getting popular, but not quite
popular enough to be on software keyboards; and when only a few people had
smartphones and iPads, but not everyone.

People would post long-form content to my site that took a long time for them
to compose, and maybe once a month, someone would email me to complain that
their long post was mysteriously cut short. Inspecting the database showed
that the content was being stored there cut off, but with no rhyme or reason.
Nothing in the error logs. 95% of other posts went through fine regardless of
length or complexity. I put in tons of debugging code at various levels of the
app to see where the data was being mis-parsed or truncated. Nothing!

The bug persisted for _years_ without a hint. Finally, emojis became more
popular, the bug started occurring more and more often, and MySQL released
utf8mb4; and then it dawned on me: Occasionally, people would submit content
that contained a Unicode emoji generated by their software keyboard, and MySQL
would _silently truncate the emoji and all the data after it WITHOUT WARNING_
when storing it in a utf8 column!! Silently!

I was so mad. A bug that plagued users for years, so much time and effort
spent trying to debug it, and it was because of a combination of MySQL's dumb
engineering decision to not support full Unicode for something called "utf8"
and then _silently truncating data without any warning_.

I don't know what the lesson is here. But I do know it made me really, really
mad at MySQL when I found out what the problem was.

~~~
mnm1
I think the lesson is "don't silently fail," and in that regard mysql has
failed at learning that lesson over and over again to the point where the
lesson is now, "don't use mysql" because what else can you do when dealing
with such shitty software? This is far, far from the only stupid bug like
this. Use mysql only if you don't care about preserving data. Period. There
are many valid use cases that don't care about preserving data 100%,
apparently most companies, so it's not a huge issue in reality. Hopefully the
data that deals with monetary transactions won't get corrupted, but when it
does, one cannot complain if they used mysql. It's that simple.

~~~
roboyoshi
does that affect MariaDB as well? That would only leave people to PostgreSQL
or NoSQL Databases like MongoDB AFAIK. I've never been a fan of MySQL, but
MariaDB always seemed to me like the spark of hope that was left in there.

~~~
mnm1
Yes because it's built on the same codebase with the same design decisions and
bugs. Maybe in the future it will diverge enough and fix such issues although
compatibility with mysql might be affected.

------
Quarrelsome
Friend of mine just recounted a story of when this bug killed all their MySql
instances across an entire enterprise installation because a user sent an
email with the hamster face emoji in the title.

~~~
bri3d
I worked at large company where an emoji or other 4-byte UTF character in a
commit message would take down multiple systems simultaneously, including JIRA
and various internal auditing and pipeline monitoring tools. It happened
surprisingly rarely considering the lack of safeguard against the issue.

~~~
joombaga
This happened to us too. We first noticed it in Bamboo with a failed build
job. It's been a long time since I installed Bamboo, but if I recall correctly
the installer adds its own schemata.

------
skywhopper
This is a fascinating tale. But the #1 takeaway by the author is dangerous:
"Database systems have subtle bugs and oddities, and you can avoid a lot of
bugs by avoiding database systems."

This is correct as far as it goes, but it's poor advice. You may as well say
"Software has subtle bugs and oddities, and you can avoid a lot of bugs by
avoiding software." True! But not helpful! If you don't need a relational
database, then yes, don't use one. But there's nothing specific about
databases as a category that deserves a special callout. You can also avoid a
lot of bugs by avoiding TLS, containers, threads, coroutines, blocking calls,
asynchronous calls, networks, filesystems, operating systems, and computers.

~~~
drb91
Dangerous? How do you figure?

Many people assume a DB is the only way to store data. Now that’s dangerous.

~~~
sattoshi
Because filesystems don't contain oddities...

~~~
drb91
All storage systems have trade offs. What else is new?

------
ofrzeta
I've "bookmarked" this: [https://mathiasbynens.be/notes/mysql-
utf8mb4](https://mathiasbynens.be/notes/mysql-utf8mb4)

It digs a little bit deeper on the MySQL side (and it's from 2012 btw)

------
chx
> If you need a database, don’t use MySQL or MariaDB. Use PostgreSQL.

This is stupid. There are pros and cons to every database. For example, MySQL
allows you to set a trigger on an op, PostgreSQL requires you to write a
function first.

~~~
neverminder
> This is stupid.

Talk about stupid, run this query on both Mysql and Postgres:

select 0 = 'banana';

Make your own conclusion.

~~~
gmac

      mysql> select 0 = 'banana';
      +--------------+
      | 0 = 'banana' |
      +--------------+
      |            1 |
      +--------------+
      1 row in set, 1 warning (0.00 sec)
    

vs

    
    
      pg# select 0 = 'banana';
      ERROR:  invalid input syntax for integer: "banana"
      LINE 1: select 0 = 'banana';

~~~
mmt
My conclusion (not just based on this example), is that MySQL favors (initial)
ease [1], whereas PostgreSQL favors correctness [2].

To suggest the former approach is "stupid" is to fail to explain how it became
so wildly popular. I allege that it is actually brilliant.. marketing.

MySQL marketed itself to developers by being not just free, but easy to work
with. Need a database? Want to learn SQL? We won't bore you or slow you down
with Codd and those pesky rules maaaan. It'll just work!

For the vast majority of use cases, it did exactly that.

Meanwhile, Postgres was esssentially sending the stodgy, old-fashioned message
that databases are a hard problem that require careful up-front consideration
so that one doesn't run into problems down the road, be they scale/performance
or more severe like (even non-catastrophic) data loss.

That this message was, and still is, correct, just isn't very compelling in a
rapid-protoyping fail-fast world. Unfortunately for those of us whose job it
is to deal with the eventual consequences, it's too late by the time we're
brought in [3].

I think we've seen a similar effect with many of the "NoSQL" datastores, as
well. They gain initial popularity due to being easy, lightweight, and
lightnight fast, but, as they mature and gain real-world use, we see articles
that are shocked (shocked!) that there is data loss going on in this
establishment and that, yes, if you need certain tedious features (like data
integrity) from that stodgy old database world, you'll have to sacrifice that
lightning fast performance.

[1] aka flexibility [2] aka standards adherence or even pedantry [3] though
I've never quite understood the reasoning of why it would be so impractical to
move RDBMSes when there's so often an ORM and no MySQL-specific features in
use

------
Strilanc
Wait, isn't 4 bytes per character _also_ not enough to store UTF8? Can't each
character be up to 8 bytes long, because the number of leading 1s in the first
byte determines the length of the character?

~~~
ainar-g
According to the Wiki, the original UTF-8 proposal used up to 6 bytes.[1] But
the Unicode standards people decided that 21 bit should be enough for
everyone:

>Since the restriction of the Unicode code-space to 21-bit values in 2003,
UTF-8 is defined to encode code points in one to four bytes, depending on the
number of significant bits in the numerical value of the code point.

So, if they ever decide to use those extra two bytes, it will probably not be
called UTF-8. Perhaps something like "UTF-8-MB-6".

[1]
[https://en.wikipedia.org/wiki/UTF-8#History](https://en.wikipedia.org/wiki/UTF-8#History)

------
MrBuddyCasino
"An optimization was chosen to limit utf8 to 3 bytes, enough to handle almost
all modern languages."

I can't for the life of me imagine how, when discussing this in a room full of
competent developers, everyone just went "great idea, lets do this". Just adds
to the long list of questionable MySQL engineering decisions.

~~~
jeltz
I am pretty sure it was not because of any optimization but that the real
reason was to make sure a VARCHAR(255) would still fit within the maximum
field size of InnoDB (768 bytes) at the time.

Since then MySQL has added ROW_FORMAT=DYNAMIC to InnoDB which allows for
larger values, but that probably required major changes to InnoDB.

~~~
MrBuddyCasino
There is usually a good reason for doing these kind of things. But then _don
't call it UTF8 encoding_! It doesn't do that. It stores the most popular
characters out of a larger address space. Call it UTF8-MB3 maybe from the
start and let the user decide if having varchar(255) is worth the tradeoff.
Thats the trouble with MySQL: doing silently the wrong thing, and thus
violating the principle of least surprise. There is software where this might
be ok, but a database is not that kind of software.

~~~
tinus_hn
In the case of MySQL (especially MySQL in 2003) bad design and silly behavior
isn’t much of a surprise though :)

------
ofrzeta
Side note: It's not enough to change the encoding of the database, you need to
make sure that all involved parties use the same encoding (or convert the
charset at the interfaces).

For instance in a Laravel app I had to set the charset in the database
configuration file "app/config/database.php".

------
willvarfar
> I’ll make a sweeping statement here: all MySQL and MariaDB users who are
> currently using “utf8” should actually use “utf8mb4”. Nobody should ever use
> “utf8”.

I completely agree with this sentiment, although I recall being bitten by it
in indexes. The article goes on to talk about how big a CHAR(1) is if its utf8
vs utf8mb4. Well, the same applies to varchars and so a varchar(64) is
potentially requiring 64, 192 or 384 bytes, depending on the charset. This can
bite you because the MySQL query planner limits the longest prefix used for
index scans to iirc 767 or so bytes. So I migrated some columns from utf8 to
utf8mb4 and suddenly my compound indices stopped being used because the
varchar contents could legally exceed that maximum. Something like that.

------
chrismorgan
One troublesome thing about utf8mb4: if you have a VARCHAR( _n_ ) column, this
means 4​ _n_ bytes are required; this doesn’t mix particularly well with
indexes, as you can only index columns of up to 767 bytes; VARCHAR(192) is
thus 768 bytes, and too long to be indexed—so you’re limited to VARCHAR(191)
or less.

This leads to things like django-celery flat-out not working on utf8mb4
without a (history-rewriting) patch, because it had a VARCHAR(255) primary
key: [https://github.com/celery/django-
celery/issues/259](https://github.com/celery/django-celery/issues/259).

~~~
pilif
then again, indexes on VARCHAR() can only really be used for equality and
start matches.

Now I'm not saying it's impossible, but I have a feeling you're not likely
going to do a lot of `where string_column =… ` or `where string_column like
'…%'` where … is 192 characters in length.

If you need to do substring search or full text search, a traditional index
won't help you anyways.

~~~
chrismorgan
Yeah, string primary keys are the only place I expect people to actually hit
this, as seen in the django-celery case.

------
kreetx
Given that the article says that everyone should use utf8mb4, then why
don't/haven't they just fixed it? Should be easy enough, and probably safe,
too.

~~~
ofrzeta
They have. It's the default in MySQL 8.

~~~
kreetx
Sorry, should have bern more clear: why didn't they fix the bug in utf8 so
that it is correct (and would do what utf8mb4 does now)? It's just that the
obvious choice of 'utf8' has been plain wrong for a long time.

~~~
shadofx
from the article:

>Once MySQL published this invalid character set, it could never fix it: that
would force every user to rebuild every database.

------
spectrox
It is worth mentioning that if you're stuck with InnoDB + COMPACT + utf8mb4
then you should be using VARCHAR(191) instead of VARCHAR(255) for unique
columns.

[https://dev.mysql.com/doc/refman/5.7/en/innodb-row-format-
an...](https://dev.mysql.com/doc/refman/5.7/en/innodb-row-format-
antelope.html)

------
inyourtenement
Currently leading a massive thorny project to convert all of our databases to
utf8mb4. Definitely heed this warning.

------
cdevnull
I use utf8 for "normal" field like name, address, etc.. (don't want smiley as
first name). And use utf8mb4 for comment for example.

But I need filter/validate each inputs (and refuse when there is 4 bytes
characters).

~~~
khaled
There are more characters outside BMP than just emoji.

------
codedokode
In case you want to convert your MySQL database tables from utf8 to utf8mb4
please remember that each character will now take 4, not 3 bytes. This means
you can easily run out of limit for a row length (65536 bytes) if you have
many string columns and your indexes will grow. Also, if a column is a part of
a foreign key reference you might have to remove the foreign key first,
upgrade columns on both sides and add the foreign key back.

So converting an existing database might be not as easy as it seems before.

------
danso
The article mentions it, but one of the massive frustrations with this issue
is the huge amount of misinformation surrounding it (at least back in 2013),
with many well-intended users on StackOverflow and the MySQL forums giving the
exactly wrong advice on collation. Naming things, and not getting confused by
bad names, is damned hard. This issue led me to quit MySQL for good, I can’t
think of a more trivial issue that caused me such confusion and frustration as
a software engineer.

------
cyberferret
Good reminder. I did read a well hidden article many years ago which highly
recommended this. I believe I was hitting an obscure, arcane bug like the OP
mentioned in the article, and after much digging managed to find this fix.

Since then, I _always_ set every new MySQL database (and even double check the
tables) to be 'utf8mb4' even long after the original lessons and pain have
been forgotten.

------
JackCh
> _" Once MySQL published this invalid character set, it could never fix it:
> that would force every user to rebuild every database. MySQL finally
> released UTF-8 support in 2010, with a different name: “utf8mb4”."_

Had MySQL defined UTF-8 as up to 6 bytes initially, would they later force
users to rebuilt their databases when UTF-8 was redefined to only 4 bytes?

------
aiCeivi9
> X have subtle bugs and oddities, and you can avoid a lot of bugs by avoiding
> X.

That doesn't sound like a useful advice

------
billpg
Dear database developers, please don't get hung up on string lengths.

I'd be very happy to ask the DB for (say) a UTF-8 string of up-to 80 bytes.
That means I could store 80 ASCII characters, 40 two-byte characters, 20 four-
byte characters or some mixture as I wish.

~~~
jeltz
At least PostgreSQL does not care about string length, other than as required
by the SQL standard (e.g. it enforces varchar(20) by counting the number of
code points). The maximum size of a string in PostgreSQL is 1 GB.

------
j_m_b
It's a code smell when what is obviously and intuitively stated, is in fact,
incorrect.

------
golergka
Oh, I love these kind of Unicode bugs. Also, please don't use utf16 in C# -
although it advertises as using utf16, all of it's built-in string processing
actually works on code units, not Unicode code points and breaks down on
surrogate pairs.

Anyway, each time you start a new project that is supposed to work correctly
with multi-language data, please set up evil strings test suite as early as
you can and do not assume that any of the 3rd party software that you're using
won't break down.

~~~
machinecoffee
Previous discussion on HN for 'Naughty Strings', was certainly an entertaining
and eye opening read...

[https://news.ycombinator.com/item?id=10035008](https://news.ycombinator.com/item?id=10035008)

------
tambourine_man
Anyone more knowledgeable than me care to explain the difference between
utf8mb4_unicode_520_ci and utf8mb4_unicode_ci?

~~~
detaro
Different versions of the Unicode collation algorithm, which determines how
strings are sorted. 520 is Version 5.2.0, without the number it's 4.0.0 by
default

[https://dev.mysql.com/doc/refman/5.6/en/charset-collation-
na...](https://dev.mysql.com/doc/refman/5.6/en/charset-collation-names.html)

~~~
tambourine_man
Thanks, but what are the implications, if any, of using either one?

------
codetrotter
Instead of MySQL, use PostgreSQL.

------
Bromskloss
> In MySQL, never use “utf8”. Use “utf8mb4”.

I thought you shouldn't use MySQL in the first place.

------
muglug
Counterpoint: emojis are bad and should be discouraged

~~~
morgo
I feel just the opposite: emojis are a lot more expressive than text and more
universal in language than en-us.

~~~
nolliebs180
What about winking emojis? I feel there's a lot more ambiguity/nuance with
them, which can be a good thing, but can also express something you may or may
not intend to.

