Hacker News new | past | comments | ask | show | jobs | submit login

UTF-8 was not restricted to U+10FFFF until November 2003. Prior to that, implementers had to assume it could encode up to 6 bytes of data.

The "utf8" mode in MySQL can only encode 3 bytes of data (up to 16 bit codepoints), which is less than 4 (less than 21 bit codepoints).

"utf8mb4" correctly encodes all UTF-8 codepoints by current standards.




> "utf8mb4" correctly encodes all UTF-8 codepoints by current standards.

Yes, all good and known, but not really what the misunderstanding is about.

"utf8" follows the definitions of UTF-8 as they stood at the time, adhering to Unicode's former code space limit.


MySQL's adoption of a utf8mb3 limitation for "utf8" was over seven years obsolete when it was adopted. It would have been obsolete in 1996.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: