
Getting Out of MySQL Character Set Hell (2009) - scintill76
https://www.bluebox.net/insight/blog-article/getting-out-of-mysql-character-set-hell
======
riffraff
ctrl+f doesn't show mention of utf8mb4 except in comments.

In short: you should only use utf8mb4 in mysql as charset, cause utf8 in mysql
is a broken invented 3-byte charset that will explode when you input 4-byte
characters (i.e. an emoji).

utf8mb4 is what the rest of the world actually calls utf8.

~~~
morgo
To clarify: the 3-byte subset is not a MySQL invention. It is the Basic
Multilingual Plane (BMP):
[http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multi...](http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane)

~~~
sly010
Most smileys now ubiquitous on smart phone keyboards are valid Unicode
characters outside of the BMP set, yet being used constantly on the web. BMP
just doesn't make sense anymore, nor does mysql utf8.

~~~
morgo
I totally agree! It also makes sense for countries that have their own
character sets to use utf8 too (as they may borrow words, characters from
other languages).

I was just pointing out that even if it was limited support, it wasn't a set
of characters that MySQL came up with.

------
donatj
The authors actual understanding of MySQLs encoding is only half way there.
Collation specifically is the algorithm it will use when ordering by the
column for instance.

The article contains numerous half truths and generalizations (the kind
someone in the process of learning would say) and not expertise.

Take what he says with a grain of salt.

~~~
TazeTSchnitzel
Collation is _both_. It combines both the locale-specific sorting approach,
and the encoding.

~~~
donatj
Collations are encoding specific, eg: utf_general_ci works specifically on
UTF8 values, that said, the collation does not determine the encoding, the
encoding determines the potential collations.

~~~
TazeTSchnitzel
But the collation field specifies both the collation and the encoding?

~~~
donatj
CREATE TABLE tbl_name (column_list) [[DEFAULT] CHARACTER SET charset_name]
[COLLATE collation_name]]

Character set is specified separately from collation. It's like having a ford
vs a ford headlight. The collation is a part of the character set.

------
tempodox
“ _Sometimes forcing the browser to treat the page as being encoded in UTF-8
will show you the characters as they 're supposed to be._”

You really _SHOULD_ always say something equivalent to charset=utf-8 in your
HTML head (naming your actual encoding, of course).

~~~
stephen_g
Naming your encoding? So always charset=utf-8 then, right?

Seriously - is there ever any reason to use anything else on the Internet?

------
jcampbell1
I use utf8mb4 with binary collation as a default.

When I then get to a column type like email, I don't know what I should use.
Dropping down to 7-bit case-insensitive sounds plausible, but another option
would be to stick with utf8mb4 but case-insensitive?

Is there a best practice?

~~~
crazygringo
I almost exclusively use utf8mb4_bin (for "human" text) and ascii_bin (for
ID's or other "computer" text) -- they're basically the two 'extremes', and
it's rare to need anything in between.

And then other collations like case-insensitive or accent-insensitive only
when required, e.g. because the field is used for search.

------
utefan001
Does MySQL have a dump(column_name) function like Oracle? The dump function
was a life saver when I needed to verify if some non-ascii columns had
transferred correctly or were now full of upside down question marks.

~~~
morgo
Easiest way is probably to run a HEX() function on the column, and compare
output.

------
davidgerard
WTF-8

