The next hurdle is to convince people that extended grapheme clusters are the smallest divisible unit, and that just like slicing at the byte level is insufficient if you’re doing UTF-8, slicing at the code point level is insufficient. Basically, strings should just be treated as opaque data in almost all places.
(Concerning UCS-2, hindsight says it should have been obvious 65,536 code points was going to be insufficient, but I’ll overlook that as an easy mistake to make; that part is better documented than the rise of UTF-16.)
Possibly the most helpful description I have is from https://en.wikipedia.org/wiki/UTF-16#History, but even that doesn’t explain why it became popular. Why did Java adopt it, Windows, browsers, &c.? Why?
Because this is 2004 MySQL, they weren't exactly known for technical acumen and forward thinking. The root of their decision is that behind the scenes "UTF8MB4" stores codepoints in fixed-size areas.
And MySQL could not index columns larger than 767 bytes.
Guess what doesn't fit in 767 bytes? VARCHAR(255) in UTF8MB4. Anything above VARCHAR(191) would not be indexable.
Meanwhile VARCHAR(255) with 3 bytes per codepoint fits just right.
So rather than fix their crap, they just decided to limit their "UTF8" to the BMP and voilà no problem.
> The only thing it had going for it was that it was roughly compatible with UCS-2; but was UCS-2 that popular at the time?
The foundations of major systems (e.g. WinNT or Java) had long been laid out (in fact WinNT was released in 1993), and while UTF8 did exist there were few use cases for going beyond the BMP; and people had (and by and large still have) the misguided view that O(1) indexation in a string is a valuable property, and so their POV was that their choice was UCS2+surrogates (which few people would ever encounter) or UCS4.
> Possibly the most helpful description I have is from https://en.wikipedia.org/wiki/UTF-16#History, but even that doesn’t explain why it became popular. Why did Java adopt it, Windows, browsers, &c.? Why?
What they adopted was UCS2, this was "backronymed" to a bastard (non-transparent, you'd see every code unit) UTF16 later on. Java didn't even support getting codepoints out of strings until Java 5 in the mid aughts: https://docs.oracle.com/javase/8/docs/api/java/lang/String.h...
The utf8mb3 optimization was made before InnoDB was the default storage engine (2010).
The goal of UTF-8 was to have an 8-bit encoding compatible with ASCII; that it could also encode more than 65.536 code points was more like a lucky accident. Maybe there was already talk about the need to extend Unicode beyond 2^16 code points, but nothing had been standardized at that time.
Even if UTF-8 had been widely known, I don't think the eventual necessity for > 16-bit code points was clear enough to have given it a chance against the relatively easy architectural shift of replacing one fixed-length character encoding with another.
I'd disagree, China was well on the way to becoming an economic power house in the early 90's:
UTF-16 is mostly used where it provided an easy upgrade path from the 16 bit restricted UCS-2 to enabling full Unicode coverage.
Switching from UCS-2 to UTF-8 would have been a pain either because of dropping backward compatibility or ridiculous wrappers to get it working.
Microsoft was one of the first companies to implement Unicode, so they didn't have much of a choice.
Java could probably have switched to UTF-8 before 1.0 was released but I guess the supposed runtime overhead for not being able to access characters at indexes was feared too much. Of course, they lost that anyway with UTF-16.
Using UTF-8 instead of UTF-16 internally is mostly a memory optimization and does not actually break any apis that I'm aware off (I'm sure there are corner cases, but am not aware of any issues here). Using UTF-16 was always somewhat valid in non western scripts where the most used characters would be represented using two bytes instead of just 1 like for most western scripts. But you are right it was a bit of a regrettable choice.
If you consider that, e.g. asians mix western script with their local script, it is a lot less clear cut. Throw things like html in the mix and you are looking at a lot of ascii with bits of asian script here and there.
I think a lot of people totally discount the enormous cost that Latin-1 users are paying for CJK (etc) support they don’t use. Maybe it’s the right call, but it isn’t obviously so.
Personally, at least for internal use, I like the idea of specialized implementations. At (immutable) string creation time decide whether to construct a Latin-1 string or a UTF-8 string (and possibly other fixed width encodings) depending on what code points are present. Expose the same set of operations (i.e. slicing on grapheme clusters) but with better performance characteristics where possible.
For storage at rest and especially communication with external programs I agree you probably should emit UTF-8.
For example, 🧐 (aka face with monocle) needs 2 UTF-16 'characters', since the UTF code point is U+1F9D0, or, in UTF-8: F0 9F A7 90.
And this ignored the point that others have made about combining characters, such as
Unicode: U+1F469 U+1F3FD U+200D U+1F33E, UTF-8: F0 9F 91 A9 F0 9F 8F BD E2 80 8D F0 9F 8C BE
That beast needs 7 UTF-16 or 15 UTF-8 "characters". I.e. 14 bytes vs 15 bytes. And this is still only one displayed character.
A woman farmer requires the "woman" and "ear of rice" emoji, with a zero-width joiner character between them. To change the skin tone from the default yellow, a "medium skin tone" (type 4) modifier is added after the woman, but before the joiner.
So the sequence "U+1F469 U+1F3FD U+200D U+1F33E" represents "woman skin-tone-4 joiner ear-of-rice". And in the UTF-8 bytes, the first four bytes are "woman", the next four are "skin-tone-4", then three for "zero-width-joiner", and finally four for "ear-of-rice".
Emojipedia helpfully lists codepoints for each multi-character emoji:
If you mean "ASCII was good for our grandfathers", say so - but don't pretend that Latin-n somehow was not a bastardized set of hacks extended upon ASCII (like all "regional" encodings, single-byte or not), and shouldn't have died at least a decade ago: suddenly a wild Latin-10 string appears, and you're down to spinning up a conversion engine, praying that something didn't slip through the cracks (and inevitably something does, eventually). If your data never crosses a process boundary - by all means, keep it in Linear B, see if I care. Caveat: maintenance of your back-and-forth conversion engine will eat you alive, or iconv will obliterate the benefits of the "enormous" savings. (I also had the great idea "we'll just use a clever(tm) scheme and only have the necessary encoding parts," sometime around 2005: Memory and processing speed improvement was marginal, burden to keep juggling One Special Encoding (Latin-2 in my case) was not.)
Latin-1 is sufficient for languages which countries representing more than 50% of global GDP speak. It might not be fair but I’m not sure this is a question of fairness to begin with.
Accented letters just don't appear that much prominently as in French or Spanish, and distinctions of acute and grave accents are often nowadays lost even on native writers if it wasn't for spellcheckers.
And German, too: while German does have ä, ö, ü & ß, they may correctly be written as ae, oe, ue & ss.
Can you be more specific about that? As I wrote in a sibling comment, I'm not convinced. I mean, nowadays you shouldn't use any of the ISO-8859-* encodings anymore of course, but we're talking about the 1990s here.
No you cannot. UTF-16 is a variable-length encoding (up to two 16-bit code units per code point).
> I think a lot of people totally discount the enormous cost that Latin-1 users are paying for CJK (etc) support they don’t use.
If we limit ourselves to Latin 1, UTF-16 requires 100% more memory than UTF-8. How often do you need to randomly access a long string anyway?
UTF-8 was designed with a much larger limit of 231 (2,147,483,648) code points (32,768 planes), and can encode 221 (2,097,152) code points (32 planes) even if limited to 4 bytes
What is wrong with UTF-8?
Even after hunting around after going to the 404 you pointed me to, I still get this:
utf8mb4: A UTF-8 encoding of the Unicode character set using one to four bytes per character.
utf8mb3: A UTF-8 encoding of the Unicode character set using one to three bytes per character.
So I don't think they are "different for SQL", whatever that means.
Sorry - not trying to sound snarky but terminology is important and this is an area people easily get tripped up in. Especially me.
I fixed the link almost immediately—pesky inclusion of the dot in “.)”
The SQL specification (I cite SQL-92) speaks of character sets (and uses the keyword CHARACTER SET). That outside of SQL, Unicode is the character set which happens to have multiple encodings, is immaterial; in SQL, each Unicode encoding is a CHARACTER SET—`CHARACTER SET utf8`, `CHARACTER SET utf8mb4`, whatever. This is the sense in which I used the term “character set”, the SQL sense.
Before UCS-2 was proved insufficient, that encoding was actually called UNICODE, Wikipedia tells me.
Concerning MySQL, utf8mb3 is an alias for utf8 which was introduced after utf8. I’m guessing it was introduced simultaneously with utf8mb4.
I didn't know this - and although I don't use MySQL much these days it good to know, thank you.
> in SQL, each Unicode encoding is a CHARACTER SET....
I see what you mean. That's pretty confusing...
> but in my experience if you open just that comment (via the “N minutes ago” link) the textbox for replying is there and works
It totally does!
That's correct, in my understanding. And yes, clicking on "N minutes ago" is the way around it.
I’ve clarified the original comment.
If I had to guess, they did it for space and performance reasons so that their benchmark results are a tiny bit better than other RDBMs. They must've figured that their limited utf8 encoding was enough. And they reasoned the immediate benefits on benchmark tests outweighed future considerations.
I can't imagine any other reason for not implementing the utf-8 encoding completely and covering the unicode character set in its entirety.
When can we have user-defined glyphs/emoji, where SVG is embedded in the unicode string?
> any SVG <text> and <foreignObject> elements within a glyph description must be ignored and not rendered
The same limitations would be applied to this hypothetical SVG mode.
To create user-defined characters, you can occupy any code point in the reserved areas. https://en.wikipedia.org/wiki/Private_Use_Areas
I was thinking: what if users of Whatsapp and other chat services could define their own emoji, as they type, and share them with other users? If a friend sends you a new emoji, you could simply copy it and use it in messages of your own. This functionality doesn't work with fixed codepoints.
Of course there's a million way to do this (even with today's Unicode standard), but it would be nice if this kind of functionality would be part of a standard, perhaps even at the level of Unicode.
There's been bugs that crash iPhones with combining Arabic, or freeze Android devices by spamming LtR override.
There's the Zalgo meme, where you drown out your words in combining diacritics.
Layout engines are already struggling to keep up with the standard, even without the ability to render a UINT_MAX x UINT_MAX canvas of infinitely repeating fractal penises.
On further reflection, you would probably want some way to efficiently signal the dimensions of the SVG-mode glyph, so that the text layout engine doesn’t need the SVG parser, and only the renderer does. That would probably add at least one more code point reservation, depending on how you chose to do it.
But I’m only saying that it’s actually not that hard to do; I firmly believe it’s a bad idea, and I’m fairly confident it will never happen.
The default for MySQL 8.0 (current GA) is utf8mb4. I also recommend reading this introduction post here: http://mysqlserverteam.com/sushi-beer-an-introduction-of-utf...
You chose not to define the UNICODE_32BIT macro that was in the original source from 2002?
The utf8[mb4] support in MySQL 8.0 is much better than prior releases. The team worked really hard on eliminating the use cases for utf8mb3 (performance) so that "just use the default of utf8mb4" can work for everyone.
While the performance differences are historical now, consider the case that because of fixed sized buffers, one query could switch from a temp table in memory to a temp table on disk. The profile of the query has effectively changed.
The ordering must be deterministic so that indexes can maintain sorted order. It's possible to create a custom collation to have your own ordering - but I would say most users don't do that.
I love MySQL, but the fact that they didn't make utf8mb4 the default for mysqldump (instead of utf8), when they created utf8mb4 for database use... is one of the most irresponsible and negligent pieces of engineering I've seen in a long time, considering the amount of data worldwide backed up with mysqldump.
The fact that a program designed solely to back up databases would default to silent corruption like this is mind-boggling. At least they've fixed this in the future for MySQL 8.0.
is your company one of them?
With this I had used a backup and nobody even noticed the problem for three weeks, thankfully I had some more direct backups too...
Since the complaint here is of corruption that is silent, the level of testing required to catch it would be extraordinary.
mysqldump --default-character-set=utf8mb4 -u user database > dump.sql
People would post long-form content to my site that took a long time for them to compose, and maybe once a month, someone would email me to complain that their long post was mysteriously cut short. Inspecting the database showed that the content was being stored there cut off, but with no rhyme or reason. Nothing in the error logs. 95% of other posts went through fine regardless of length or complexity. I put in tons of debugging code at various levels of the app to see where the data was being mis-parsed or truncated. Nothing!
The bug persisted for years without a hint. Finally, emojis became more popular, the bug started occurring more and more often, and MySQL released utf8mb4; and then it dawned on me: Occasionally, people would submit content that contained a Unicode emoji generated by their software keyboard, and MySQL would silently truncate the emoji and all the data after it WITHOUT WARNING when storing it in a utf8 column!! Silently!
I was so mad. A bug that plagued users for years, so much time and effort spent trying to debug it, and it was because of a combination of MySQL's dumb engineering decision to not support full Unicode for something called "utf8" and then silently truncating data without any warning.
I don't know what the lesson is here. But I do know it made me really, really mad at MySQL when I found out what the problem was.
In the world of production code issues most people rank data loss as the most severe or the second most severe category of error.
Designing this kind of crap into a database is just pissing on your users. It was the last straw. I can’t trust you if you work like this. I’m not working with any code written by these clowns again.
Yes, this really should have been the default from the start. It would have saved a lot of developers a lot of misery. But the past is the past. At least the default was finally fixed a few years back.
Many large companies using MySQL did proactively enable strict mode long before the default change, and they simply do not have any of these issues. That's a good thing considering the combined valuation of companies relying on MySQL is several trillion dollars. These companies do care about preserving data 100%, which is why strict sql_mode has existed as an option in MySQL for nearly 15 years now.
That's only true if you ignore the non-free options, which is what, I would argue, Postgres and Mysql were originally competing against in the first place.
> MariaDB always seemed to me like the spark of hope
Still, I'm not entirely sure what point you're trying to make. That is, what are you hoping for?
This is correct as far as it goes, but it's poor advice. You may as well say "Software has subtle bugs and oddities, and you can avoid a lot of bugs by avoiding software." True! But not helpful! If you don't need a relational database, then yes, don't use one. But there's nothing specific about databases as a category that deserves a special callout. You can also avoid a lot of bugs by avoiding TLS, containers, threads, coroutines, blocking calls, asynchronous calls, networks, filesystems, operating systems, and computers.
Many people assume a DB is the only way to store data. Now that’s dangerous.
It digs a little bit deeper on the MySQL side (and it's from 2012 btw)
This is stupid. There are pros and cons to every database. For example, MySQL allows you to set a trigger on an op, PostgreSQL requires you to write a function first.
I have used both MySQL and Postgres, although admittedly I haven't touched MySQL for a few years (the last time I tried to use MySQL for something, I found it would happily serve me up an arbitrary result for a clearly nonsensical query).
I thought this comment on an earlier thread summed up my feelings about the two systems rather beautifully: https://news.ycombinator.com/item?id=15173100
>Every software has issues; so know your software's gotchas.
With that said, the perception I get is that PostgreSQL has less such gotchas and works better "out of the box". Though still not optimally, according to our Ops.
People usually argue against postgresql because it used to be slow on multi-core systems, the vacuum was kinda crappy in PGSQL <8.2 and it lacks a lot of features in base (because typically pgsql developers don't accept something in base unless it's very stable and orthogonal, which I consider to be a good thing).
It's like the OpenBSD of Databases.
Talk about stupid, run this query on both Mysql and Postgres:
select 0 = 'banana';
Make your own conclusion.
mysql> select 0 = 'banana';
| 0 = 'banana' |
| 1 |
1 row in set, 1 warning (0.00 sec)
pg# select 0 = 'banana';
ERROR: invalid input syntax for integer: "banana"
LINE 1: select 0 = 'banana';
To suggest the former approach is "stupid" is to fail to explain how it became so wildly popular. I allege that it is actually brilliant.. marketing.
MySQL marketed itself to developers by being not just free, but easy to work with. Need a database? Want to learn SQL? We won't bore you or slow you down with Codd and those pesky rules maaaan. It'll just work!
For the vast majority of use cases, it did exactly that.
Meanwhile, Postgres was esssentially sending the stodgy, old-fashioned message that databases are a hard problem that require careful up-front consideration so that one doesn't run into problems down the road, be they scale/performance or more severe like (even non-catastrophic) data loss.
That this message was, and still is, correct, just isn't very compelling in a rapid-protoyping fail-fast world. Unfortunately for those of us whose job it is to deal with the eventual consequences, it's too late by the time we're brought in .
I think we've seen a similar effect with many of the "NoSQL" datastores, as well. They gain initial popularity due to being easy, lightweight, and lightnight fast, but, as they mature and gain real-world use, we see articles that are shocked (shocked!) that there is data loss going on in this establishment and that, yes, if you need certain tedious features (like data integrity) from that stodgy old database world, you'll have to sacrifice that lightning fast performance.
 aka flexibility
 aka standards adherence or even pedantry
 though I've never quite understood the reasoning of why it would be so impractical to move RDBMSes when there's so often an ORM and no MySQL-specific features in use
Show warnings enabled.
mysql> select 0 = 'banana';
| 0 = 'banana' |
| 1 |
1 row in set, 1 warning (0.00 sec)
Warning (Code 1292): Truncated incorrect DOUBLE value: 'banana'
If you need a database, why would you ever pick the stupid one..,
>Since the restriction of the Unicode code-space to 21-bit values in 2003, UTF-8 is defined to encode code points in one to four bytes, depending on the number of significant bits in the numerical value of the code point.
So, if they ever decide to use those extra two bytes, it will probably not be called UTF-8. Perhaps something like "UTF-8-MB-6".
I don't think so. Not to point you to the wiki again but . That's a pretty good description of how it works, so you don't get to use the all of the bits in the first byte to describe how long the character is in bytes.
I can't for the life of me imagine how, when discussing this in a room full of competent developers, everyone just went "great idea, lets do this". Just adds to the long list of questionable MySQL engineering decisions.
Since then MySQL has added ROW_FORMAT=DYNAMIC to InnoDB which allows for larger values, but that probably required major changes to InnoDB.
* Use MySQL 8.0 :-)
MySQL added its three-byte utf8 support in mysql 4.1, first beta release in April 2003 . Meanwhile the utf8 standard originally permitted up to 6 bytes per char, until officially changed in November 2003 .
Unicode support in web software -- and even programming languages -- was still rapidly evolving in 2003. The MySQL team probably just made a bad guess along the lines of "the standard allows 6 bytes but that has terrible perf and memory implications for us, and most people only use 3 bytes, so let's limit to that".
While I've lost track of how many hundreds of times I've had to tell coworkers "use utf8mb4 and not utf8 in mysql", I don't begrudge the mysql dev team for not being clairvoyant in 2003!
For instance in a Laravel app I had to set the charset in the database configuration file "app/config/database.php".
I completely agree with this sentiment, although I recall being bitten by it in indexes. The article goes on to talk about how big a CHAR(1) is if its utf8 vs utf8mb4. Well, the same applies to varchars and so a varchar(64) is potentially requiring 64, 192 or 384 bytes, depending on the charset. This can bite you because the MySQL query planner limits the longest prefix used for index scans to iirc 767 or so bytes. So I migrated some columns from utf8 to utf8mb4 and suddenly my compound indices stopped being used because the varchar contents could legally exceed that maximum. Something like that.
This leads to things like django-celery flat-out not working on utf8mb4 without a (history-rewriting) patch, because it had a VARCHAR(255) primary key: https://github.com/celery/django-celery/issues/259.
> MySQL 5.7 (2015) added some optimizations such as a variable length sort buffer, and also changed InnoDB’s default row format to DYNAMIC. This allows for indexes on VARCHAR(255) with utf8mb4; something that made migrations more difficult prior.
Now I'm not saying it's impossible, but I have a feeling you're not likely going to do a lot of `where string_column =… ` or `where string_column like '…%'` where … is 192 characters in length.
If you need to do substring search or full text search, a traditional index won't help you anyways.
The sorting order for the utf8 collations is not identical to utf8mb4 ones, and so an upgrade would require indexed columns to be rebuilt (remember, indexes are sorted).
I think with hindsight it would have been better to rip the bandaid off. 8.0 took a middle of the road approach by switching the default to utf8mb4.
>Once MySQL published this invalid character set, it could never fix it: that would force every user to rebuild every database.
But I need filter/validate each inputs (and refuse when there is 4 bytes characters).
So converting an existing database might be not as easy as it seems before.
Since then, I always set every new MySQL database (and even double check the tables) to be 'utf8mb4' even long after the original lessons and pain have been forgotten.
Had MySQL defined UTF-8 as up to 6 bytes initially, would they later force users to rebuilt their databases when UTF-8 was redefined to only 4 bytes?
That doesn't sound like a useful advice
I'd be very happy to ask the DB for (say) a UTF-8 string of up-to 80 bytes. That means I could store 80 ASCII characters, 40 two-byte characters, 20 four-byte characters or some mixture as I wish.
There is a pathological case which is not common: max length of all 4-byte characters. Indexes are used to enforce features like unique constraints, it's important that everything can fit in the buffer length.
Anyway, each time you start a new project that is supposed to work correctly with multi-language data, please set up evil strings test suite as early as you can and do not assume that any of the 3rd party software that you're using won't break down.
I thought you shouldn't use MySQL in the first place.