WordPress Latin1 and UTF-8, Part 2

bellyfullofbac · on Oct 17, 2021

The 00's called, they want their encoding problem back. (Not being dismissive, I'm just amazed that the default setup can still break nowadays). I remember dealing with this stuff back in 2004, a page that helped me understand why ö became Ã¶ was Jeppe's Unicode Page:

http://www.jeppesn.dk/utf-8.html

17 years later and the website is still there...

krallja · on Oct 17, 2021

PHP+MySQL still hasn’t figured this stuff out? I remember fighting mojibake in FogBugz 6 for Linux/Mac, fifteen years ago!

marcan_42 · on Oct 17, 2021

Only recently did MySQL start defaulting to proper Unicode support, and it's still not the default everywhere in the stack nor on all OSes.

I did a mysqldump/restore last year with default options and it ate all emoji in the database. This is because MySQL's "utf-8" isn't real UTF-8, it's "utf8mb3" which does not support non-BMP characters. You have to explicitly ask for utf8mb4 to get sane behavior. In my case the database was configured correctly, but mysqldump still defaulted to utf8mb3. I found out that day all my database backups had been subtly corrupted forever.

This is one reason I use PostgreSQL everywhere I can get away with it. It has infinitely saner tooling. But WordPress insists on being married to MySQL...

KronisLV · on Oct 17, 2021

> You have to explicitly ask for utf8mb4 to get sane behavior.

Here's more information about the difference between utf8 and utfmb4 in MySQL, in case anyone is wondering: https://stackoverflow.com/a/30074553

I'd say that in most modern circumstances you probably should just go with utfmb4, unless you have a good reason for doing otherwise (apart from only ever needing to use the BMP, or having some odd storage restrictions).

Can anyone come up with an actual good reason to use utf8mb3?

bombcar · on Oct 17, 2021

The easiest test for utf8mb4 is try to round-trip a emoji smiley. UTF real handles it, the fake version doesn’t.

kalleboo · on Oct 17, 2021

In this case his problem likely stems from when the install was set up in 2007, 14 years ago. Back then having these charset mismatches in LAMP were super common

These days the defaults tend to be slightly more sane. PHP is still messed up with the various mb_ string functions though.

zerocrates · on Oct 17, 2021

This specific problem used to be extremely common.

You set up the database, you know you want Unicode, so you set it to use utf8 (we'll ignore for now MySQL's legacy BMP issue), your tables, your columns are using it: great.

You run your site for a while and everything is fine: characters all display fine, though you do occasionally notice some searches with non-ascii characters don't work quite as you expect, and one day you use an admin tool and you see things are messed up...

What you didn't realize was that you set up all that stuff but not the connection charset, which used to default to latin1. So MySQL sees your UTF-8 data coming in, thinks "this is latin1" because that's what the connection charset is set to, sees your tables want utf8, and happily "converts" your UTF-8 text to UTF-8. You don't notice it because it does the same thing in reverse when sending results back to you, so you think everything's fine.

Things got wise to setting the connection charset, avoiding this problem, and eventually the defaults also changed. So it's been pretty uncommon for new databases for a while now.

But, you still see people in this poster's situation, discovering their database is stored in this mis-encoded form when they move to new servers or versions of PHP that changed the default connection charset.

It's fairly easy to fix the data so it's stored correctly and you don't need the incorrect latin1 parameter to hide the problem, but just hope you didn't run a while without noticing the charset change so you now have a mix of good and bad encodings in the same tables.

marcosdumay · on Oct 17, 2021

Anything on Windows has problems with it, anything in Java has some difficulty, the Oracle DB has just adjusted and many people still don't know how to get the modern setting. The corporate world still didn't do the UTF-8 migration.

But I don't know about MySQL. What I do know is that PHP by itself doesn't have this problem anymore.

dn3500 · on Oct 17, 2021

It's figured out now. The problem is that when the database was created in 2007 someone specified the wrong character set.

xvilka · on Oct 17, 2021

I hope one day Microsoft will deprecate their non-unicode API, remove that TCHAR mess. I think they supported these legacy APIs long enough already. There's still a problem with PATH_MAX, see, for example, the dotnet MSBuild[1][2][3]. Maybe even switch to the UTF-8 instead.

[1] https://github.com/dotnet/msbuild/issues/53

[2] https://github.com/dotnet/msbuild/issues/5331

[3] https://github.com/dotnet/msbuild/issues/4200

CodesInChaos · on Oct 17, 2021

I hope one day Microsoft will remove support for arbitrary 16-bit code-units from their API and only accept valid UTF-16. That shouldn't break too many applications, but will finally allow people to use unicode strings to handle filenames.

Switching the default "ANSI" encoding to UTF-8 (or WTF-8) would be pretty nice as well, unless an application is running in compatibility mode. That should avoid most of the problems with the 8-bit API, without breaking everything.

And in the long run perhaps even switch to UTF-8 internally, making the 16-bit API legacy (one can dream).

j16sdiz · on Oct 17, 2021

This would upset many librarian and archivist

zerocrates · on Oct 17, 2021

A nice little clever way to fix this problem so things are actually stored correctly in the tables: run mysqldump and tell it you want to dump to latin1. It will do the same charset conversion that "fixes" the data here when setting the connection charset, so it will actually dump UTF-8 out. Then just edit the dump and change the SET NAMES line to "utf8mb4" instead of latin1. Loading in that dump should get you correctly-encoded UTF-8 text.

Then you can take the latin1 connection charset out of your app, or replace it with an explicit utf8/utf8mb4 one.

You can do a whole "convert to latin1, then binary, then back to utf8" dance live on the server too but this is cooler.

WalterGR · on Oct 17, 2021

If you read the follow-up comments, you saw that I was eventually able to resolve the problem (mostly) by...

It was refreshing to see this in the intro, rather than having to wade through the text. The conclusion of anything is important context while reading, and is useful if one is short on time.

lugged · on Oct 17, 2021

You might like, https://youtu.be/vtIzMaLkCaM

Something I wish more writers watched.