Hacker News new | comments | show | ask | jobs | submit login
In MySQL, don’t use “utf8”, use “utf8mb4” (2016) (medium.com)
244 points by mariuz 5 months ago | hide | past | web | favorite | 160 comments



Possibly my favourite thing about the rise of emoji is that they’re not Basic Multilingual Plane, and so stupid hacks like MySQL’s utf8 character set (I seriously don’t know why anyone ever thought it was a good idea in 2002–2004) are now obviously insufficient for full Unicode support, and enough people care about them that adoption of this basic level of Unicode is driven, and so non-English speakers are inadvertently helped.

The next hurdle is to convince people that extended grapheme clusters are the smallest divisible unit, and that just like slicing at the byte level is insufficient if you’re doing UTF-8, slicing at the code point level is insufficient. Basically, strings should just be treated as opaque data in almost all places.


While I’m contemplating why people might have thought utf8mb3 was wise in 2004, I’ll ask a similar question that I’d love to know the answer to, of events from my early childhood: why did UTF-16 gain popularity? I get UCS-2 existing, but once it became evident it wasn’t enough, why make UTF-16 and ruin Unicode with surrogate pairs (which still cause trouble today, leading to things like WTF-8 to make it possible for web browsers to abandon their UTF-16/UCS-2 mess in favour of almost-UTF-8)? UTF-8 was around by early 1993, while UTF-16 was developed several years later and seems fairly obviously the worst of both worlds—twice as large for normal text, ASCII-incompatible, and not even a fixed-width encoding, the only thing it almost had over UTF-8. The only thing it had going for it was that it was roughly compatible with UCS-2; but was UCS-2 that popular at the time?

(Concerning UCS-2, hindsight says it should have been obvious 65,536 code points was going to be insufficient, but I’ll overlook that as an easy mistake to make; that part is better documented than the rise of UTF-16.)

Possibly the most helpful description I have is from https://en.wikipedia.org/wiki/UTF-16#History, but even that doesn’t explain why it became popular. Why did Java adopt it, Windows, browsers, &c.? Why?


> While I’m contemplating why people might have thought utf8mb3 was wise in 2004

Because this is 2004 MySQL, they weren't exactly known for technical acumen and forward thinking. The root of their decision is that behind the scenes "UTF8MB4" stores codepoints in fixed-size areas.

And MySQL could not index columns larger than 767 bytes.

Guess what doesn't fit in 767 bytes? VARCHAR(255) in UTF8MB4. Anything above VARCHAR(191) would not be indexable.

Meanwhile VARCHAR(255) with 3 bytes per codepoint fits just right.

So rather than fix their crap, they just decided to limit their "UTF8" to the BMP and voilà no problem.

> The only thing it had going for it was that it was roughly compatible with UCS-2; but was UCS-2 that popular at the time?

The foundations of major systems (e.g. WinNT or Java) had long been laid out (in fact WinNT was released in 1993), and while UTF8 did exist there were few use cases for going beyond the BMP; and people had (and by and large still have) the misguided view that O(1) indexation in a string is a valuable property, and so their POV was that their choice was UCS2+surrogates (which few people would ever encounter) or UCS4.

> Possibly the most helpful description I have is from https://en.wikipedia.org/wiki/UTF-16#History, but even that doesn’t explain why it became popular. Why did Java adopt it, Windows, browsers, &c.? Why?

What they adopted was UCS2, this was "backronymed" to a bastard (non-transparent, you'd see every code unit) UTF16 later on. Java didn't even support getting codepoints out of strings until Java 5 in the mid aughts: https://docs.oracle.com/javase/8/docs/api/java/lang/String.h...

I'm pretty sure most string processing in Java and C# remains unaware of surrogate pairs and can corrupt proper UTF16 data rather interestingly. That is most definitely an extant issue in Javascript (not that properly handling that is even remotely sufficient to do actual unicode-aware string processing).


> Guess what doesn't fit in 767 bytes? VARCHAR(255) in UTF8MB4. Anything above VARCHAR(191) would not be indexable.

The utf8mb3 optimization was made before InnoDB was the default storage engine (2010).


It was widely used before it was the default.


Certainly, but let’s not rewrite history here. What innodb could/couldn’t do was not the motivation.


Why not? Even as not the default, it was an official bundled storage engine, and the cutoff just happens to occur right at its boundaries, that doesn't sound like a coincidence.


I like how your write this as if I'm not still running into these index column size issues today ;)


Windows NT 3.1 was released in July 1993. Early versions of Windows NT used UCS-2. At that time, UTF-8 was not very well known yet, and it was certainly too late to change all that code, and especially the APIs.

I'm not sure about Java and Javascript, but I think they were developed before it became clear to most people that UCS-2 was not going to be enough (which was around, or likely even sometime after the time that UTF-16 was published in 1996).

Windows NT, Java and Javascript became popular, and none of them could easily change the width of their standard character encoding, which was also used internally in all of their code.

The goal of UTF-8 was to have an 8-bit encoding compatible with ASCII; that it could also encode more than 65.536 code points was more like a lucky accident. Maybe there was already talk about the need to extend Unicode beyond 2^16 code points, but nothing had been standardized at that time.


I think folks who weren't there might not realize how big a deal the original Unicode standard was and how quickly technology vendors raced to adopt it. Windows NT, for example, must have been developed with Unicode in mind even before the standard was official. In those circumstances the broad adoption of UCS-2 was effectively inevitable.

Even if UTF-8 had been widely known, I don't think the eventual necessity for > 16-bit code points was clear enough to have given it a chance against the relatively easy architectural shift of replacing one fixed-length character encoding with another.


One factor is that back in the early 90s China was still an economically backward, closed Communist enclave barely more integrated into the global economy than North Korea is now. One of the deciding factors in having to decisively move beyond the BMP was China's adoption of GB 18030 as a mandatory standard in 2006[1].

[1]https://en.wikipedia.org/wiki/GB_18030


> One factor is that back in the early 90s China was still an economically backward, closed Communist enclave barely more integrated into the global economy

I'd disagree, China was well on the way to becoming an economic power house in the early 90's:

https://en.wikipedia.org/wiki/Five-year_plans_of_China#Eight...


But it wasn't yet. Decades later, the infrastructure stabilized (although in 2010 consumer network products were still incredibly chaotic).


It certainly wasn't the North Korean type of basket case simonh was accusing it of being.


I would disagree with the statement that UTF-16 gained popularity.

UTF-16 is mostly used where it provided an easy upgrade path from the 16 bit restricted UCS-2 to enabling full Unicode coverage.

Switching from UCS-2 to UTF-8 would have been a pain either because of dropping backward compatibility or ridiculous wrappers to get it working.

Microsoft was one of the first companies to implement Unicode, so they didn't have much of a choice.

Java could probably have switched to UTF-8 before 1.0 was released but I guess the supposed runtime overhead for not being able to access characters at indexes was feared too much. Of course, they lost that anyway with UTF-16.


Java 1.0 was made back in the days when "65K characters would be enough for anyone!" hence they chose UCS-2. Java only supported UTF-16 as of Java 5. I suspect that if they could turn back time, they would have chosen UTF-8 from the get-go, but, as you say, that was a battle lost long ago.


Java 9 released last year changed the internal string representation to not be UTF-16 anymore Up until then characters took at least 2 bytes in Java. Of course the internal representation and what is written to IO streams are two things. UTF-8 has been supported pretty much from day 1 along with a whole range of legacy encodings that used to be common.

Using UTF-8 instead of UTF-16 internally is mostly a memory optimization and does not actually break any apis that I'm aware off (I'm sure there are corner cases, but am not aware of any issues here). Using UTF-16 was always somewhat valid in non western scripts where the most used characters would be represented using two bytes instead of just 1 like for most western scripts. But you are right it was a bit of a regrettable choice.

If you consider that, e.g. asians mix western script with their local script, it is a lot less clear cut. Throw things like html in the mix and you are looking at a lot of ascii with bits of asian script here and there.


With utf-8 you give up any hope of O(1) string operations. With utf-16 you can get O(1) string operations at a cost of doing them on code points rather than graphemes. If greater than 95% of the strings your language/program will ever see are going to be a single code point that trade off seems worth it.

I think a lot of people totally discount the enormous cost that Latin-1 users are paying for CJK (etc) support they don’t use. Maybe it’s the right call, but it isn’t obviously so.

Personally, at least for internal use, I like the idea of specialized implementations. At (immutable) string creation time decide whether to construct a Latin-1 string or a UTF-8 string (and possibly other fixed width encodings) depending on what code points are present. Expose the same set of operations (i.e. slicing on grapheme clusters) but with better performance characteristics where possible.

For storage at rest and especially communication with external programs I agree you probably should emit UTF-8.


Except that with utf-16 you still have to give up any hope of O(1) string operations, unless you don't care about accuracy. Only UTF-32 allows O(1) index based operations.

For example, 🧐 (aka face with monocle) needs 2 UTF-16 'characters', since the UTF code point is U+1F9D0, or, in UTF-8: F0 9F A7 90.

And this ignored the point that others have made about combining characters, such as ‍ woman farmer Unicode: U+1F469 U+1F3FD U+200D U+1F33E, UTF-8: F0 9F 91 A9 F0 9F 8F BD E2 80 8D F0 9F 8C BE

That beast needs 7 UTF-16 or 15 UTF-8 "characters". I.e. 14 bytes vs 15 bytes. And this is still only one displayed character.


To clarify the woman farmer being 14 or 15 bytes: the example given is four different combined characters, and is additionally a "medium skin tone woman farmer".

A woman farmer requires the "woman" and "ear of rice" emoji, with a zero-width joiner character between them. To change the skin tone from the default yellow, a "medium skin tone" (type 4) modifier is added after the woman, but before the joiner.

So the sequence "U+1F469 U+1F3FD U+200D U+1F33E" represents "woman skin-tone-4 joiner ear-of-rice". And in the UTF-8 bytes, the first four bytes are "woman", the next four are "skin-tone-4", then three for "zero-width-joiner", and finally four for "ear-of-rice".

Emojipedia helpfully lists codepoints for each multi-character emoji: https://emojipedia.org/female-farmer/


In 2004 no one cared about face with monocle or woman farmer (I still don’t). The question was why did people pick UTF-16 back then. The answer is they were willing to make accuracy trade-offs for performance. Especially where they thought those trade-offs would only impact a small percentage of their users.


How exactly is Latin-1 privileged in this regard? Note its suffix - writing this from a region where other single-byte encodings proliferated in the 1990s, before sort-of settling to Latin-2 and/or Windows-1250 in the aughts (no, they're not the same, and the mapping is subtly broken).

If you mean "ASCII was good for our grandfathers", say so - but don't pretend that Latin-n somehow was not a bastardized set of hacks extended upon ASCII (like all "regional" encodings, single-byte or not), and shouldn't have died at least a decade ago: suddenly a wild Latin-10 string appears, and you're down to spinning up a conversion engine, praying that something didn't slip through the cracks (and inevitably something does, eventually). If your data never crosses a process boundary - by all means, keep it in Linear B, see if I care. Caveat: maintenance of your back-and-forth conversion engine will eat you alive, or iconv will obliterate the benefits of the "enormous" savings. (I also had the great idea "we'll just use a clever(tm) scheme and only have the necessary encoding parts," sometime around 2005: Memory and processing speed improvement was marginal, burden to keep juggling One Special Encoding (Latin-2 in my case) was not.)


> How exactly is Latin-1 privileged in this regard?

Latin-1 is sufficient for languages which countries representing more than 50% of global GDP speak. It might not be fair but I’m not sure this is a question of fairness to begin with.


Oh. In that case, you do mean ASCII, methinks (which also works), with Latin-1 you're shooting yourself in both feet. (Looks for link to "falsehoods programmers believe about encodings")


To my understanding, ASCII basically only covers English and Italian well, whereas Latin-1 covers most Western European languages — which includes the main languages of the Americas (Spanish, English, French, Portuguese).


Italian requires è, ù, ò, à, ì, é, ó, í and (rarely used) î

Accented letters just don't appear that much prominently as in French or Spanish, and distinctions of acute and grave accents are often nowadays lost even on native writers if it wasn't for spellcheckers.


> To my understanding, ASCII basically only covers English and Italian well

And German, too: while German does have ä, ö, ü & ß, they may correctly be written as ae, oe, ue & ss.


Not really if you can make a good living writing software for only the US/UK, France, Germany, Italy and a few more Western European countries, which is perfectly feasible. Latin1 works perfectly fine there, and if you wanted, and you didn't have to interact too much with other systems, you could just completely ignore all other encodings until UTF-8 started to get traction in the later 1990s. It even got into many RFCs as the default encoding until the RFC editors started to enforce everyone to implement UTF-8 and default to it.


Oh, you could have done that in 1990s and well into the aughts, no doubt about that - that's exactly what happened :) I thought you proposed Latin-1 as useful today.


Oh no, no way. I'm not 100% sure about bradleyjg though... Maybe I misunderstood them.


I think Latin-1 compatible strings are common enough to be worth optimizing for with separate code paths. At least in large projects like OSes and programming languages. That doesn’t mean I think Unicode support should be omitted.


In such cases I would think the optimization is basically for ASCII, and then extend that to Latin1 because it happens to be the first 256 characters of Unicode, which means processing is trivial and you don't waste the other 128 byte values that way. But I figure that 99.9% of those strings would be ASCII-only.


> with Latin-1 you're shooting yourself in both feet

Can you be more specific about that? As I wrote in a sibling comment, I'm not convinced. I mean, nowadays you shouldn't use any of the ISO-8859-* encodings anymore of course, but we're talking about the 1990s here.


I beg to differ, considering that it doesn't contain the € sign.


The list of what it doesn't contain, with a 256 character space, is...long ;)


> With utf-16 you can get O(1) string operations at a cost of doing them on code points rather than graphemes.

No you cannot. UTF-16 is a variable-length encoding (up to two 16-bit code units per code point).

> I think a lot of people totally discount the enormous cost that Latin-1 users are paying for CJK (etc) support they don’t use.

If we limit ourselves to Latin 1, UTF-16 requires 100% more memory than UTF-8. How often do you need to randomly access a long string anyway?


Right, code units not code points. My mistake. Nonetheless that’s what you get back from e.g. java’s charAt.


You can do O(1) string operations in UTF-8 too, if you do them at the code unit level. It's just as wrong, but it's more obvious that it's wrong because it only works for ASCII instead of only working for BMP.


Bringing things full circle in this thread, this is absolutely why it is a big deal that emoji have become so popular and that people care about them, and that they are mostly homed in the Astral Plane. Now there's a giant corpus of UTF-16 data people are interacting with daily that absolutely makes it clear that you can't treat UTF-16 like UCS-2, and if you are still doing bad string operations in 2018 you have fewer excuses and more unhappy users ("why is my emoji broken?!").


Same thing I said above except even more so. What’s the most used language that can’t be represented with the BMP? Bengali is one possibility but AFAIK there’s a widely used Arabic form in use as well as the traditional script.


They'd kind of what happened in python 3.3. The internal representation is flexible, depending on the string contents. It may be just 1byte ASCII, or it may be utf32 if needed: https://www.python.org/dev/peps/pep-0393/


Java 11 is going to a similar model.


Does anyone know if anything similar is planned for Javascript? It kills me that String.length (and [], codePointAt and so on) are living footguns. Though as other commenters have said, the rise of emoji has at least brought attention to the bugs caused by naive use of these tools.


> not Basic Multilingual Plane, and so stupid hacks like the utf8 character set (I seriously don’t know why anyone ever thought it was a good idea in 2004) are now obviously insufficient for full Unicode support

From [0]:

UTF-8 was designed with a much larger limit of 231 (2,147,483,648) code points (32,768 planes), and can encode 221 (2,097,152) code points (32 planes) even if limited to 4 bytes

What is wrong with UTF-8?

[0]: https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilin...


Not with utf-8 but the "utf8" mysql encoding.


ok - but parent referred to "utf8 character set". As if things weren't confusing enough already!


Can't reply to the sibling comment by Chris, but there isn't a "utf8 character set". There's just different encodings of a character set.

Even after hunting around after going to the 404 you pointed me to, I still get this:

utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.

utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.

So I don't think they are "different for SQL", whatever that means.

Sorry - not trying to sound snarky but terminology is important and this is an area people easily get tripped up in. Especially me.


Not sure why the reply link is sometimes missing, but in my experience if you open just that comment (via the “N minutes ago” link) the textbox for replying is there and works. Perhaps the link is omitted to discourage rapid conversation?

I fixed the link almost immediately—pesky inclusion of the dot in “.)”

---

The SQL specification (I cite SQL-92) speaks of character sets (and uses the keyword CHARACTER SET). That outside of SQL, Unicode is the character set which happens to have multiple encodings, is immaterial; in SQL, each Unicode encoding is a CHARACTER SET—`CHARACTER SET utf8`, `CHARACTER SET utf8mb4`, whatever. This is the sense in which I used the term “character set”, the SQL sense.

Before UCS-2 was proved insufficient, that encoding was actually called UNICODE, Wikipedia tells me.

Concerning MySQL, utf8mb3 is an alias for utf8 which was introduced after utf8. I’m guessing it was introduced simultaneously with utf8mb4.


> utf8mb3 is an alias for utf8 which was introduced after utf8

I didn't know this - and although I don't use MySQL much these days it good to know, thank you.

> in SQL, each Unicode encoding is a CHARACTER SET....

I see what you mean. That's pretty confusing...

> but in my experience if you open just that comment (via the “N minutes ago” link) the textbox for replying is there and works

It totally does!


I didn't know about the utf8mb3 name until today.


> Perhaps the link is omitted to discourage rapid conversation?

That's correct, in my understanding. And yes, clicking on "N minutes ago" is the way around it.


They’re called character sets in SQL; see, for example, https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8... .

I’ve clarified the original comment.


In case anyone is confused (since I know a lot of people see weird numbers and are just like, "Ugh, math hates me"), there's missing formatting in the above comment. It's supposed to read 2^31 and 2^21, not 231 and 221.


> and so stupid hacks like MySQL’s utf8 character set (I seriously don’t know why anyone ever thought it was a good idea in 2002–2004) are now obviously insufficient for full Unicode support

If I had to guess, they did it for space and performance reasons so that their benchmark results are a tiny bit better than other RDBMs. They must've figured that their limited utf8 encoding was enough. And they reasoned the immediate benefits on benchmark tests outweighed future considerations.

I can't imagine any other reason for not implementing the utf-8 encoding completely and covering the unicode character set in its entirety.


Correct.


There’s nothing wrong with slicing at the byte level if you’re slicing with a delimiter like \n or “. UTF-8 sets high bits so continuation bytes never look like ASCII characters.


Combined emoji, such as 🇺🇸, are great for this as well! Things like also help - separating them out, even on spaces, breaks the emoji. We are really making progress towards proper processing.


> Possibly my favourite thing about the rise of emoji is that they’re not Basic Multilingual Plane

When can we have user-defined glyphs/emoji, where SVG is embedded in the unicode string?


You’d just define two code points that switches into and out of SVG mode. (The second is only necessary for performance, so you can scan quickly without needing any sort of XML parser!)


What if nested <text> element also containers an inline emosvg? :-)


Concerning the SVG table in OpenType fonts, https://docs.microsoft.com/en-au/typography/opentype/spec/sv... declares:

> any SVG <text> and <foreignObject> elements within a glyph description must be ignored and not rendered

The same limitations would be applied to this hypothetical SVG mode.


You can have that now, but it doesn't work the way you think. The SVG is embedded in the font, not the string. Demo: https://eosrei.github.io/emojione-color-font/full-demo.html

To create user-defined characters, you can occupy any code point in the reserved areas. https://en.wikipedia.org/wiki/Private_Use_Areas


That's nice, but I meant something else.

I was thinking: what if users of Whatsapp and other chat services could define their own emoji, as they type, and share them with other users? If a friend sends you a new emoji, you could simply copy it and use it in messages of your own. This functionality doesn't work with fixed codepoints.

Of course there's a million way to do this (even with today's Unicode standard), but it would be nice if this kind of functionality would be part of a standard, perhaps even at the level of Unicode.


That sounds like a cool idea, but a complete pain to implement.

There's been bugs that crash iPhones with combining Arabic, or freeze Android devices by spamming LtR override.

There's the Zalgo meme, where you drown out your words in combining diacritics.

Layout engines are already struggling to keep up with the standard, even without the ability to render a UINT_MAX x UINT_MAX canvas of infinitely repeating fractal penises.


It actually wouldn’t be as bad as you might imagine: pick whichever of the SVG-in-font specifications wins, and then all you need to do is apply that subset of SVG to the text, effectively allowing inline definition of the font. Roughly no new attack surface exposed in the text layout and rendering engine, given the existing ability to load untrusted fonts.

On further reflection, you would probably want some way to efficiently signal the dimensions of the SVG-mode glyph, so that the text layout engine doesn’t need the SVG parser, and only the renderer does. That would probably add at least one more code point reservation, depending on how you chose to do it.

But I’m only saying that it’s actually not that hard to do; I firmly believe it’s a bad idea, and I’m fairly confident it will never happen.


Hi! Former Product Manager for MySQL Server here. I have a blog post on this here: https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-ov...

The default for MySQL 8.0 (current GA) is utf8mb4. I also recommend reading this introduction post here: http://mysqlserverteam.com/sushi-beer-an-introduction-of-utf...


> An optimization was chosen to limit utf8 to 3 bytes

You chose not to define the UNICODE_32BIT macro that was in the original source from 2002?

* https://github.com/mysql/mysql-server/commit/55e0a9cb01af4b0...


The optimization was made at a different time in history. As I wrote in another comment here: while the decision is regrettable, because of a desire for backwards compatibility and simplified upgrade there wasn't an easy way out of it.

The utf8[mb4] support in MySQL 8.0 is much better than prior releases. The team worked really hard on eliminating the use cases for utf8mb3 (performance) so that "just use the default of utf8mb4" can work for everyone.


Is it really an optimization? I always assumed they picked 3 bytes to make sure a varchar(255) still would fit within the InnoDB field size limit of 768 bytes, so they could avoid/postpone having to fix InnoDB to support variable size fields.


This decision was made before InnoDB was the default.

While the performance differences are historical now, consider the case that because of fixed sized buffers, one query could switch from a temp table in memory to a temp table on disk. The profile of the query has effectively changed.


Yes, but did it happen before InnoDB was widely used?


You haven't answered my question at all, which was whether this optimization (as you put it) was the choice not to define that macro.


Do you have a collation were no two different strings compare equal? Have you considered imposing an arbitrary order where you don't know which one "should" go first?


Yes - there is a binary collation.

w.r.t. arbitrary:

The ordering must be deterministic so that indexes can maintain sorted order. It's possible to create a custom collation to have your own ordering - but I would say most users don't do that.


Sorry if I was unclear I meant some ordering that sorts by natural language but breaks ties in some arbitrary (but deterministic) manner, e.g. by binary.


Extremely important: If you use mysqldump, make sure it's also using utf8mb4. There is a very high chance it defaults to utf8 and silently corrupts your data.


Oh my god, this is both terrifying and could one day prove to be the most valuable comment I've seen on HN.

I love MySQL, but the fact that they didn't make utf8mb4 the default for mysqldump (instead of utf8), when they created utf8mb4 for database use... is one of the most irresponsible and negligent pieces of engineering I've seen in a long time, considering the amount of data worldwide backed up with mysqldump.

The fact that a program designed solely to back up databases would default to silent corruption like this is mind-boggling. At least they've fixed this in the future for MySQL 8.0.


Imagine how many companies out there are relying on backups that are already corrupted...

is your company one of them?


This is an important reason why testing your backups is a critical part of backing up.


It can be a very subtle issue, though. How would you recommend testing backups for rare corruption? Especially when live data is changing.

With this I had used a backup and nobody even noticed the problem for three weeks, thankfully I had some more direct backups too...


This is oft-repeated advice, but, in this context, it ends up being little more than a platitude.

Since the complaint here is of corruption that is silent, the level of testing required to catch it would be extraordinary.



Thanks, that would be:

mysqldump --default-character-set=utf8mb4 -u user database > dump.sql

right?


Yeah, that should work fine. It's easy to test, just output a row that has an emoji.


This bit me hard many years ago in possibly the most frustrating bug I've encountered--frustrating because of the stupidity of MySQL's implementation. I had a utf8 MySQL database that was created back before utf8mb4 existed--this is, back in the days when emojis were just getting popular, but not quite popular enough to be on software keyboards; and when only a few people had smartphones and iPads, but not everyone.

People would post long-form content to my site that took a long time for them to compose, and maybe once a month, someone would email me to complain that their long post was mysteriously cut short. Inspecting the database showed that the content was being stored there cut off, but with no rhyme or reason. Nothing in the error logs. 95% of other posts went through fine regardless of length or complexity. I put in tons of debugging code at various levels of the app to see where the data was being mis-parsed or truncated. Nothing!

The bug persisted for years without a hint. Finally, emojis became more popular, the bug started occurring more and more often, and MySQL released utf8mb4; and then it dawned on me: Occasionally, people would submit content that contained a Unicode emoji generated by their software keyboard, and MySQL would silently truncate the emoji and all the data after it WITHOUT WARNING when storing it in a utf8 column!! Silently!

I was so mad. A bug that plagued users for years, so much time and effort spent trying to debug it, and it was because of a combination of MySQL's dumb engineering decision to not support full Unicode for something called "utf8" and then silently truncating data without any warning.

I don't know what the lesson is here. But I do know it made me really, really mad at MySQL when I found out what the problem was.


I already had substantial beef with MySQL and took a job I wanted despite the fact we were using it. Eight months later we are fixing exactly this bug.

In the world of production code issues most people rank data loss as the most severe or the second most severe category of error.

Designing this kind of crap into a database is just pissing on your users. It was the last straw. I can’t trust you if you work like this. I’m not working with any code written by these clowns again.


I think the lesson is "don't silently fail," and in that regard mysql has failed at learning that lesson over and over again to the point where the lesson is now, "don't use mysql" because what else can you do when dealing with such shitty software? This is far, far from the only stupid bug like this. Use mysql only if you don't care about preserving data. Period. There are many valid use cases that don't care about preserving data 100%, apparently most companies, so it's not a huge issue in reality. Hopefully the data that deals with monetary transactions won't get corrupted, but when it does, one cannot complain if they used mysql. It's that simple.


MySQL does not silently fail if strict sql_mode is in use, which has been the default since MySQL 5.7 (2015), and a recommended best practice for many years prior.

Yes, this really should have been the default from the start. It would have saved a lot of developers a lot of misery. But the past is the past. At least the default was finally fixed a few years back.

Many large companies using MySQL did proactively enable strict mode long before the default change, and they simply do not have any of these issues. That's a good thing considering the combined valuation of companies relying on MySQL is several trillion dollars. These companies do care about preserving data 100%, which is why strict sql_mode has existed as an option in MySQL for nearly 15 years now.


does that affect MariaDB as well? That would only leave people to PostgreSQL or NoSQL Databases like MongoDB AFAIK. I've never been a fan of MySQL, but MariaDB always seemed to me like the spark of hope that was left in there.


Yes because it's built on the same codebase with the same design decisions and bugs. Maybe in the future it will diverge enough and fix such issues although compatibility with mysql might be affected.


> That would only leave people to PostgreSQL or NoSQL Databases like MongoDB AFAIK.

That's only true if you ignore the non-free options, which is what, I would argue, Postgres and Mysql were originally competing against in the first place.

> MariaDB always seemed to me like the spark of hope

Still, I'm not entirely sure what point you're trying to make. That is, what are you hoping for?


Friend of mine just recounted a story of when this bug killed all their MySql instances across an entire enterprise installation because a user sent an email with the hamster face emoji in the title.


I worked at large company where an emoji or other 4-byte UTF character in a commit message would take down multiple systems simultaneously, including JIRA and various internal auditing and pipeline monitoring tools. It happened surprisingly rarely considering the lack of safeguard against the issue.


This happened to us too. We first noticed it in Bamboo with a failed build job. It's been a long time since I installed Bamboo, but if I recall correctly the installer adds its own schemata.


This is a fascinating tale. But the #1 takeaway by the author is dangerous: "Database systems have subtle bugs and oddities, and you can avoid a lot of bugs by avoiding database systems."

This is correct as far as it goes, but it's poor advice. You may as well say "Software has subtle bugs and oddities, and you can avoid a lot of bugs by avoiding software." True! But not helpful! If you don't need a relational database, then yes, don't use one. But there's nothing specific about databases as a category that deserves a special callout. You can also avoid a lot of bugs by avoiding TLS, containers, threads, coroutines, blocking calls, asynchronous calls, networks, filesystems, operating systems, and computers.


With a slight difference: MySQL or whatnot is battle-tested through millions of systems worldwide, while your in-memory data structure that you save to disk.... :-)


Dangerous? How do you figure?

Many people assume a DB is the only way to store data. Now that’s dangerous.


Because filesystems don't contain oddities...


All storage systems have trade offs. What else is new?


I've "bookmarked" this: https://mathiasbynens.be/notes/mysql-utf8mb4

It digs a little bit deeper on the MySQL side (and it's from 2012 btw)


> If you need a database, don’t use MySQL or MariaDB. Use PostgreSQL.

This is stupid. There are pros and cons to every database. For example, MySQL allows you to set a trigger on an op, PostgreSQL requires you to write a function first.


Thing A having specific advantages over Thing B in no way implies that Thing B is not clearly superior overall.

I have used both MySQL and Postgres, although admittedly I haven't touched MySQL for a few years (the last time I tried to use MySQL for something, I found it would happily serve me up an arbitrary result for a clearly nonsensical query).

I thought this comment on an earlier thread summed up my feelings about the two systems rather beautifully: https://news.ycombinator.com/item?id=15173100


I agree that that point sounds too rough. I would write something like

>Every software has issues; so know your software's gotchas.

With that said, the perception I get is that PostgreSQL has less such gotchas and works better "out of the box". Though still not optimally, according to our Ops.


PostgreSQL is pretty good- the default config limits the resource consumption to something you very likely wont want; and things like WAL shipping have some awkward gotchas but it's really stable and quite consistent.

People usually argue against postgresql because it used to be slow on multi-core systems, the vacuum was kinda crappy in PGSQL <8.2 and it lacks a lot of features in base (because typically pgsql developers don't accept something in base unless it's very stable and orthogonal, which I consider to be a good thing).

It's like the OpenBSD of Databases.


> This is stupid.

Talk about stupid, run this query on both Mysql and Postgres:

select 0 = 'banana';

Make your own conclusion.


  mysql> select 0 = 'banana';
  +--------------+
  | 0 = 'banana' |
  +--------------+
  |            1 |
  +--------------+
  1 row in set, 1 warning (0.00 sec)
vs

  pg# select 0 = 'banana';
  ERROR:  invalid input syntax for integer: "banana"
  LINE 1: select 0 = 'banana';


My conclusion (not just based on this example), is that MySQL favors (initial) ease [1], whereas PostgreSQL favors correctness [2].

To suggest the former approach is "stupid" is to fail to explain how it became so wildly popular. I allege that it is actually brilliant.. marketing.

MySQL marketed itself to developers by being not just free, but easy to work with. Need a database? Want to learn SQL? We won't bore you or slow you down with Codd and those pesky rules maaaan. It'll just work!

For the vast majority of use cases, it did exactly that.

Meanwhile, Postgres was esssentially sending the stodgy, old-fashioned message that databases are a hard problem that require careful up-front consideration so that one doesn't run into problems down the road, be they scale/performance or more severe like (even non-catastrophic) data loss.

That this message was, and still is, correct, just isn't very compelling in a rapid-protoyping fail-fast world. Unfortunately for those of us whose job it is to deal with the eventual consequences, it's too late by the time we're brought in [3].

I think we've seen a similar effect with many of the "NoSQL" datastores, as well. They gain initial popularity due to being easy, lightweight, and lightnight fast, but, as they mature and gain real-world use, we see articles that are shocked (shocked!) that there is data loss going on in this establishment and that, yes, if you need certain tedious features (like data integrity) from that stodgy old database world, you'll have to sacrifice that lightning fast performance.

[1] aka flexibility [2] aka standards adherence or even pedantry [3] though I've never quite understood the reasoning of why it would be so impractical to move RDBMSes when there's so often an ORM and no MySQL-specific features in use


I've never had the privilege of using PostgreSQL but when using MySQL, I always enable automatic display of warnings (which really should be errors):

    mysql> \W
    Show warnings enabled.
    mysql> select 0 = 'banana';
    +--------------+
    | 0 = 'banana' |
    +--------------+
    |            1 |
    +--------------+
    1 row in set, 1 warning (0.00 sec)

    Warning (Code 1292): Truncated incorrect DOUBLE value: 'banana'


What is so stupid with avoiding the living hell of maintaning crap: silent data truncation, weird inconcistencies or this very issue.

If you need a database, why would you ever pick the stupid one..,


Wait, isn't 4 bytes per character also not enough to store UTF8? Can't each character be up to 8 bytes long, because the number of leading 1s in the first byte determines the length of the character?


According to the Wiki, the original UTF-8 proposal used up to 6 bytes.[1] But the Unicode standards people decided that 21 bit should be enough for everyone:

>Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode code points in one to four bytes, depending on the number of significant bits in the numerical value of the code point.

So, if they ever decide to use those extra two bytes, it will probably not be called UTF-8. Perhaps something like "UTF-8-MB-6".

[1] https://en.wikipedia.org/wiki/UTF-8#History


> Can't each character be up to 8 bytes long

I don't think so. Not to point you to the wiki again but [0]. That's a pretty good description of how it works, so you don't get to use the all of the bits in the first byte to describe how long the character is in bytes.

https://en.wikipedia.org/wiki/UTF-8#Description


"An optimization was chosen to limit utf8 to 3 bytes, enough to handle almost all modern languages."

I can't for the life of me imagine how, when discussing this in a room full of competent developers, everyone just went "great idea, lets do this". Just adds to the long list of questionable MySQL engineering decisions.


I am pretty sure it was not because of any optimization but that the real reason was to make sure a VARCHAR(255) would still fit within the maximum field size of InnoDB (768 bytes) at the time.

Since then MySQL has added ROW_FORMAT=DYNAMIC to InnoDB which allows for larger values, but that probably required major changes to InnoDB.


There is usually a good reason for doing these kind of things. But then don't call it UTF8 encoding! It doesn't do that. It stores the most popular characters out of a larger address space. Call it UTF8-MB3 maybe from the start and let the user decide if having varchar(255) is worth the tradeoff. Thats the trouble with MySQL: doing silently the wrong thing, and thus violating the principle of least surprise. There is software where this might be ok, but a database is not that kind of software.


In the case of MySQL (especially MySQL in 2003) bad design and silly behavior isn’t much of a surprise though :)


There are various buffers that were* not variable length in query execution (sorting, temp tables). It is not just about the on-disk format in this case.

* Use MySQL 8.0 :-)


Sure, in hindsight, 3 bytes is an abomination to us today. That was almost certainly not apparent to developers at the time.

MySQL added its three-byte utf8 support in mysql 4.1, first beta release in April 2003 [1]. Meanwhile the utf8 standard originally permitted up to 6 bytes per char, until officially changed in November 2003 [2].

Unicode support in web software -- and even programming languages -- was still rapidly evolving in 2003. The MySQL team probably just made a bad guess along the lines of "the standard allows 6 bytes but that has terrible perf and memory implications for us, and most people only use 3 bytes, so let's limit to that".

While I've lost track of how many hundreds of times I've had to tell coworkers "use utf8mb4 and not utf8 in mysql", I don't begrudge the mysql dev team for not being clairvoyant in 2003!

[1] https://lists.mysql.com/announce/161

[2] https://en.wikipedia.org/wiki/UTF-8#History


Side note: It's not enough to change the encoding of the database, you need to make sure that all involved parties use the same encoding (or convert the charset at the interfaces).

For instance in a Laravel app I had to set the charset in the database configuration file "app/config/database.php".


> I’ll make a sweeping statement here: all MySQL and MariaDB users who are currently using “utf8” should actually use “utf8mb4”. Nobody should ever use “utf8”.

I completely agree with this sentiment, although I recall being bitten by it in indexes. The article goes on to talk about how big a CHAR(1) is if its utf8 vs utf8mb4. Well, the same applies to varchars and so a varchar(64) is potentially requiring 64, 192 or 384 bytes, depending on the charset. This can bite you because the MySQL query planner limits the longest prefix used for index scans to iirc 767 or so bytes. So I migrated some columns from utf8 to utf8mb4 and suddenly my compound indices stopped being used because the varchar contents could legally exceed that maximum. Something like that.


One troublesome thing about utf8mb4: if you have a VARCHAR(n) column, this means 4​n bytes are required; this doesn’t mix particularly well with indexes, as you can only index columns of up to 767 bytes; VARCHAR(192) is thus 768 bytes, and too long to be indexed—so you’re limited to VARCHAR(191) or less.

This leads to things like django-celery flat-out not working on utf8mb4 without a (history-rewriting) patch, because it had a VARCHAR(255) primary key: https://github.com/celery/django-celery/issues/259.


This is supposedly fixed in databases created with MySQL 5.7 and up.

> MySQL 5.7 (2015) added some optimizations such as a variable length sort buffer, and also changed InnoDB’s default row format to DYNAMIC. This allows for indexes on VARCHAR(255) with utf8mb4; something that made migrations more difficult prior.

https://mysqlserverteam.com/mysql-8-0-when-to-use-utf8mb3-ov...


then again, indexes on VARCHAR() can only really be used for equality and start matches.

Now I'm not saying it's impossible, but I have a feeling you're not likely going to do a lot of `where string_column =… ` or `where string_column like '…%'` where … is 192 characters in length.

If you need to do substring search or full text search, a traditional index won't help you anyways.


Yeah, string primary keys are the only place I expect people to actually hit this, as seen in the django-celery case.


Given that the article says that everyone should use utf8mb4, then why don't/haven't they just fixed it? Should be easy enough, and probably safe, too.


The desire for backwards compatibility/easier upgrades is why utf8mb4 could not just replace utf8:

The sorting order for the utf8 collations is not identical to utf8mb4 ones, and so an upgrade would require indexed columns to be rebuilt (remember, indexes are sorted).

I think with hindsight it would have been better to rip the bandaid off. 8.0 took a middle of the road approach by switching the default to utf8mb4.


They have. It's the default in MySQL 8.


Sorry, should have bern more clear: why didn't they fix the bug in utf8 so that it is correct (and would do what utf8mb4 does now)? It's just that the obvious choice of 'utf8' has been plain wrong for a long time.


from the article:

>Once MySQL published this invalid character set, it could never fix it: that would force every user to rebuild every database.


This can't silently be changed as the setting has impact on disk format for data and indexes and so and will be confusing when using different server versions, i.e. in replication.


It is worth mentioning that if you're stuck with InnoDB + COMPACT + utf8mb4 then you should be using VARCHAR(191) instead of VARCHAR(255) for unique columns.

https://dev.mysql.com/doc/refman/5.7/en/innodb-row-format-an...


Currently leading a massive thorny project to convert all of our databases to utf8mb4. Definitely heed this warning.


I use utf8 for "normal" field like name, address, etc.. (don't want smiley as first name). And use utf8mb4 for comment for example.

But I need filter/validate each inputs (and refuse when there is 4 bytes characters).


There are more characters outside BMP than just emoji.


Sorry for not doing my research but, aren't there 4-byte regular characters from other languages? Is it all emojis and symbols?


In case you want to convert your MySQL database tables from utf8 to utf8mb4 please remember that each character will now take 4, not 3 bytes. This means you can easily run out of limit for a row length (65536 bytes) if you have many string columns and your indexes will grow. Also, if a column is a part of a foreign key reference you might have to remove the foreign key first, upgrade columns on both sides and add the foreign key back.

So converting an existing database might be not as easy as it seems before.


The article mentions it, but one of the massive frustrations with this issue is the huge amount of misinformation surrounding it (at least back in 2013), with many well-intended users on StackOverflow and the MySQL forums giving the exactly wrong advice on collation. Naming things, and not getting confused by bad names, is damned hard. This issue led me to quit MySQL for good, I can’t think of a more trivial issue that caused me such confusion and frustration as a software engineer.


Good reminder. I did read a well hidden article many years ago which highly recommended this. I believe I was hitting an obscure, arcane bug like the OP mentioned in the article, and after much digging managed to find this fix.

Since then, I always set every new MySQL database (and even double check the tables) to be 'utf8mb4' even long after the original lessons and pain have been forgotten.


> "Once MySQL published this invalid character set, it could never fix it: that would force every user to rebuild every database. MySQL finally released UTF-8 support in 2010, with a different name: “utf8mb4”."

Had MySQL defined UTF-8 as up to 6 bytes initially, would they later force users to rebuilt their databases when UTF-8 was redefined to only 4 bytes?


> X have subtle bugs and oddities, and you can avoid a lot of bugs by avoiding X.

That doesn't sound like a useful advice


Dear database developers, please don't get hung up on string lengths.

I'd be very happy to ask the DB for (say) a UTF-8 string of up-to 80 bytes. That means I could store 80 ASCII characters, 40 two-byte characters, 20 four-byte characters or some mixture as I wish.


At least PostgreSQL does not care about string length, other than as required by the SQL standard (e.g. it enforces varchar(20) by counting the number of code points). The maximum size of a string in PostgreSQL is 1 GB.


While I agree with you, as I understand the SQL standard defines the N in varchar(N) as characters (not bytes).

There is a pathological case which is not common: max length of all 4-byte characters. Indexes are used to enforce features like unique constraints, it's important that everything can fit in the buffer length.


It's a code smell when what is obviously and intuitively stated, is in fact, incorrect.


Oh, I love these kind of Unicode bugs. Also, please don't use utf16 in C# - although it advertises as using utf16, all of it's built-in string processing actually works on code units, not Unicode code points and breaks down on surrogate pairs.

Anyway, each time you start a new project that is supposed to work correctly with multi-language data, please set up evil strings test suite as early as you can and do not assume that any of the 3rd party software that you're using won't break down.


Previous discussion on HN for 'Naughty Strings', was certainly an entertaining and eye opening read...

https://news.ycombinator.com/item?id=10035008


Anyone more knowledgeable than me care to explain the difference between utf8mb4_unicode_520_ci and utf8mb4_unicode_ci?


Different versions of the Unicode collation algorithm, which determines how strings are sorted. 520 is Version 5.2.0, without the number it's 4.0.0 by default

https://dev.mysql.com/doc/refman/5.6/en/charset-collation-na...


Thanks, but what are the implications, if any, of using either one?


Instead of MySQL, use PostgreSQL.


> In MySQL, never use “utf8”. Use “utf8mb4”.

I thought you shouldn't use MySQL in the first place.


[flagged]


Ridiculous. They fixed it 7 years ago.


No, the default was fixed in MySQL 8.0, which is about to be released soon. MySQL has improved a lot since Oracle bought it but it takes time to undo all the bad design choices.


I can confirm I ran into this issue this year with AWS Aurora Mysql which definitely does the wrong thing by default. Basically the only reason we use mysql at this point is that aurora mysql works fine with t2.small.db and posgresql requires bigger instances somehow and is thus way more expensive to run. These days, I definitely prefer postgresql if I can get away with using that. I've also been eyeing cockroach db for a while. Their transactions are serializable by default and work in a clustered multi master & sharded setup is intriguing to me. Aurora updates still involve down time because you always have a single point of failure in the form of an active master.


Counterpoint: emojis are bad and should be discouraged


The nice thing about emojis (maybe the only nice thing) is that they are so popular that they force companies and developers to implement proper support for all planes of Unicode, which means that all characters and symbols in those planes that are actually useful are then also supported.


It's not only for emoji, it's also for less common scripts.


I feel just the opposite: emojis are a lot more expressive than text and more universal in language than en-us.


What about winking emojis? I feel there's a lot more ambiguity/nuance with them, which can be a good thing, but can also express something you may or may not intend to.


Why?




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: