> Except in Safari, whose maxlength implementation seems to treat all emoji as length 1. This means that the maxlength attribute is not fully interoperable between browsers.
No, it's definitely not. You can read the byte length more directly in JS, and use that to inform if more text is allowed or not.
const encoder = new TextEncoder();
const currentBytes = encoder.encode(inputStr).byteLength;
But the maxlength attribute is at best an approximation. Don't rely on it for things like limiting length for database fields (not that you should trust the client anyway).
Apple's take seems more reasonable. When a user uses an emoji, they think of it as a single symbol, they don't care of the Unicode implementation, or its length in bytes. IMO this should be the standard, and all other interpretations are a repeat of the transition from ASCII to Unicode.
Agreed, I'm surprised the linked OP filed a bug for Webkit when I'd say it's the only correct implementation.
HTML is for UIs and I don't think many users would expect an emoji to be counted as 11 symbols when it looks like just one. If this wasn't an emoji but a symbol from a more complex writing system splitting that up would clearly be unacceptable, so I don't see why one should make an exception here just because this one known example of a family emoji still mostly survives the truncation.
Indeed Swift is one of the very few languages that even have a stdlib API for working with this intuitive definition of a "character" (the technical Unicode term is "(extended) grapheme cluster" and the division of text into EGCs is called Unicode text segmentation, described by [uax29] – spoiler: definitely not trivial).
The number of graphemes an emoji displays as depends on the platform and font. How many "characters" does Safari think Ninja Cat[1] is? It displays as a single grapheme on Windows 10.
What I recall of the standard algorithm is that it does include font/user-agent/locale pieces in its calculations and "zero ambiguity" is a bit of a stretch, unfortunately.
Even the overview includes this note:
> Note: Font-based information may be required to determine the appropriate unit to use for UI purposes
First of all, grapheme counting depends on normalization and there's like 6 Unicode normalization algorithms to consider, depending on locale and intended use/display style.
Keep in mind that grapheme counting includes things like natural ligatures which English has always had a rough time counting. Things like `fi` and `ft` that sometimes form single visual graphemes from the font's perspective for some serif fonts but are always supposed to be counted as two graphemes from the user's perspective.
Relatedly, one of the simpler parts of the Unicode standard algorithm is that the ZWJ always signals a grapheme cluster/merged grapheme, so per the standard algorithm the Ninja Cat is always a single grapheme even though the usual emoji rules for all the non-Microsoft fonts that don't include Cat+ZWJ+Job sequences ("cats with jobs emojis") fallback to two display characters to try to get the idea across, the spec does say that they should still count that as only a single grapheme even when not presented as such.
(ETA Aside: I think the cats with jobs emojis are great and fun and should be wider adopted outside of just Microsoft. Why should only the people emojis have jobs?)
> First of all, grapheme counting depends on normalization and there's like 6 Unicode normalization algorithms to consider, depending on locale and intended use/display style.
It says "The boundary specifications are stated in terms of text normalized according to Normalization Form NFD (see Unicode Standard Annex #15, “Unicode Normalization Forms” [UAX15]). In practice, normalization of the input is not required."
And there's also "Even in Normalization Form NFC, a syllable block may contain a precomposed Hangul syllable in the middle."
So are you sure normalization matters? I'm excluding K normalizations since they're not for general use and alter a bunch of characters on purpose. But also that's only NFC/NFD/NFKC/NFKD, are there two others or was "six" just a misremembering?
Nobody that cares about keeping text intact uses K. And you don't have to care about how they count characters. It won't affect your system, and they're probably counting something dumb.
NFC/NFD will give you the same graphemes.
Or to put it another way, there's only two real normalizations, and there's some weird junk off to the side.
And nope, it doesn't depend at all on font, user agent, or locale. Nor does it depend on normalization (that's explicitly stated). And like you say, a ligature for 'fi' is two graphemes. All of this is for the purposes of text selection, counting characters, etc.
You're right there are additional considerations in rendering/UX, particularly with font ligatures. But that operates at a different level from graphemes.
(The spec also explains that more sophisticated grapheme definitions can be constructed for locales. But "The Unicode definitions of grapheme clusters are defaults: not meant to exclude the use of more sophisticated definitions of tailored grapheme clusters where appropriate." So it works as a single shared canonical reference.)
That's my point, though, the algorithm defines "defaults" and "starting points" then suggests places like font/user agent/locale that may need to make more sophisticated judgement calls. That's very different from "an unambiguous algorithm that applies in all circumstances".
I included this note from that very report:
> Note: Font-based information may be required to determine the appropriate unit to use for UI purposes, such as identification of boundaries for first-letter paragraph styling.
You fixated on this additional pull quote:
> The Unicode definitions of grapheme clusters are defaults: not meant to exclude the use of more sophisticated definitions of tailored grapheme clusters where appropriate. Such definitions may more precisely match the user expectations within individual languages for given processes.
I would add additional emphasis on the word default there which the document in other places contrasts to tailored as the preferred terminology between "simple algorithm rules of thumb" that this particular algorithm provides and user-focused/locale-dependent grapheme considerations (which still do vary a lot).
For instance, this bit on tailored graphemes:
> Grapheme clusters can be tailored to meet further requirements. Such tailoring is permitted, but the possible rules are outside of the scope of this document. One example of such a tailoring would be for the aksaras, or orthographic syllables, used in many Indic scripts.
(And a later table of more examples of tailored graphemes that the "default grapheme" algorithm will not catch the subtleties of.)
As for relying on normalization, I may have misread this particular pull quote:
> A key feature of default Unicode grapheme clusters (both legacy and extended) is that they remain unchanged across all canonically equivalent forms of the underlying text. Thus the boundaries remain unchanged whether the text is in NFC or NFD.
I read that as implying that algorithm input needed to be NFC or NFD (though there is clearly no difference to the algorithm which of the two forms it is in), not that the algorithm should also just as well work on unnormalized inputs. Rereading it again, I'm still not sure if the algorithm is well-defined on unnormalized inputs, but I can see how it reads as a possible implication that it might work. (Maybe?)
I'm not saying Apple's implementation should be the gold standard, just that characters that the Unicode specification allows to combine into one character, should be counted as one character.
I understand that technical limitations will sometimes cause a combined character to be displayed as more than one character, as it happens in Safari with that ninja cat, and in those cases, it's preferable to decompose the emoji, or use a tofu character, than changing it for another.
They are decomposing the emoji, are they not? The resulting glyph (with three family members rather than four) is just the result of the ligatures between the remaining emojis.
It's ambiguous. At least on my system (X11), I can move the cursor through a 4-member family with one press of Left or Right, and can't put the cursor inside it, and I can delete it with one press of Delete. But it takes 5 presses of Backspace to delete it.
Emojis simply manifest existing bugs in programs. Those bugs would have existed anyway as emojis are not the only symbols beyond the BMP nor the only symbols composed of grapheme clusters.
Being much more frequent than other more exotic characters they just reveal bugs more often.
An incredibly popular and successful mistake, in that case. Is the purpose of Unicode to be an uncluttered spec, or to be a useful way to represent text?
Emojis aren’t text, though. Nobody “writes” emoji outside of the digital world so I don’t think emoji could accurately be called “text” or part of a “writing system:
> Unicode, formally The Unicode Standard,[note 1][note 2] is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems
People write smiley faces, but there was already a code point for basic smileys.
I think standardized emoticons are a good thing, but I don’t think the Unicode spec is where they best belong.
Emoticons are a "writing system" (whether or not you believe them to be more "text" or "meta-text") that needs encoding, and one that was encoded in standards that predated Unicode. (They were at first added for compatibility with CJK systems that had already been using them for some time. They've been greatly expanded since those original encodings that Unicode "inherited" from other text encodings, but they started as text encodings in other standards.)
Personally, I think Unicode actually did the world a favor by taking ownership of emoji and standardizing them. There's nothing "weird" that emoji do that doesn't exist somewhere else in some other language's text encodings. Things like ZWJ sequences for complex graphemes existed in a number of languages that Unicode encoded well before emoji. Emoji being common and popular helps break certain Western-centric assumptions that language encodings are "simple" and 1-to-1 code point to grapheme, assumptions that had been needed to break for many years before emoji gave Unicode users a tool to test that even "Western developers" needed to respect if they wanted happy users. Better emoji support is better Unicode support across the lovely and incredible diversity of what Unicode encodes.
Aside: Also, given the current, common vernacular usages of the eggplant and peach emoji alone, I find it difficult in 2023 to argue that emoji aren't text and aren't being used as a language inside text for and of text.
Given that emoji are most typically used in text, having it all together seems better than two bodies coming up with incompatible solutions to emoji and the glyph issues of various languages.
In the end emoji is a language, a visual icon language.
What’s the difference between kanji and emoji? Tradition? Most text is now created digitally, so no wonder you have reflexivity here. You might think that it’s tail wagging the dog, but in truth the dog and the tail switched places. People want to write emoji, so it becomes Unicode.
And depends on the location. For example, the characters U+1F1FA U+1F1F8 are the regional indicators "U" and "S", and are rendered as . These are two separate codepoints that may together be displayed as a United States flag. Similarly, the regional indicators "T" and "W" are rendered as and "H" and "K" are rendered as . On my system, this is rendered as the flag of Taiwan and Hong Kong, respectively. Depending on where you live, these regional indicators might not be rendered as flags.
> Depending on where you live, these regional indicators might not be rendered as flags.
This is also why Microsoft-vended emoji fonts don't include flags/"regional indicators" support at all. On a Windows machine you just see the letters US or TW in an ugly boxed type and nothing like a flag.
The interesting background story goes: everyone remembers that Windows 95 had that fancy graphical Time Zone selector that looked real pretty in screenshots (but wasn't actually the best way to select time zones anyway). The reasons Microsoft removed that tool are less well known and almost entirely geopolitical: time zone borders often correspond to country borders and several countries got mad at their border looking wrong and sued Microsoft for "lying" about their borders. After a bunch of micro-changes to the graphics of that widget and then a lot of money spent on lawsuits and geopolitical fights, Microsoft eventually just entirely removed the widget because it was a "nice-to-have" and never actually necessary. It is said that from that fight Microsoft decided that it never wanted to be involved in that sort of geopolitics ever again, and flags are nothing if not geopolitical symbols, so Microsoft's emoji fonts don't encode hardly any flags at all. (In exchange they encode a lot of fun "cats with jobs" that other font vendors still don't. I find that a fun trade-off myself.)
The user can submit data of any length… browsers aren’t required to implement validation for you. Nor is your user required to use a browser or client you wrote. This is security and application development 102, welcome to my class.
It isn't just for emoji, either. More ordinary grapheme clusters like characters with combining accents are also perceived by users as a single character, and should be counted as one by software.
This breaks down in the terminal where you need to know which codepoints are double width for managing layout. Some are ambiguous width because they've been present since before the inclusion of emoji and had optional emoji presentation added to them. The final determination is font dependent so you can never be sure without full insight into the rendering chain.
This doesn't break down anything. If a grapheme cluster is double width and you treat it as 5x because there's 5 code points in it, then you've still gotten the layout wrong. You can enforce text storage length limitations to prevent infinite zalgo, but text display length limitations should always deal in extended grapheme clusters.
No, they don't even agree between engine implementations within the same database server. Generally limits in the database are defined as storage limits for however a character is defined. That usually means bytes or codepoints.
Thankfully MySQL also offers a non-gimped version of UTF-8 that one should always use in preference to the 3-byte version, but yeah it sucks that it's not the "obvious" version of UTF-8.
Is this part of MySQL's policy of "do the thing I've always done, no matter how daft or broken that may be, unless I see an obscure setting telling me to do the new correct thing" ?
That'd be my guess, but I don't really know. They just left the "utf8" type as broken 3-byte gibbled UTF-8, and added the "utf8mb4" type and "utf8mb4_unicode_ci" collation for "no, actually, I want UTF-8 for real".
It won't. We settled on using stateful combining characters instead.
(Remember when the selling point of switching the world to Unicode was "represent all writing systems with a single stateless 16 bit encoding"? Yeah, well, lol.)
No the default these days is the saner utf8mb4, if you create a new database on a modern MySQL version. But if you have an old database using the old encoding then upgrading databases doesn't magically update the encoding because some people take backwards compatibility serious.
It seems odd to suggest the bug is with Safari. Normal humans (and even most developers!) don’t care that the byte length of an emoji in a particular encoding that may or may not be under their control defines the maximum “characters” in a text box (character used to define a logical collection of code points each of which may fit into one or multiple bytes).
It's a bug with Safari because the HTML spec defines maxlength as applying to the number of UTF-16 code units [1]:
> Constraint validation: If an element has a maximum allowed value length, its dirty value flag is true, its value was last changed by a user edit (as opposed to a change made by a script), and the length of the element's API value is greater than the element's maximum allowed value length, then the element is suffering from being too long.
Where "the length of" is a link to [2]:
> A string’s length is the number of code units it contains.
And "code units" is a link to [3]:
> A string is a sequence of unsigned 16-bit integers, also known as code units.
I agree with your implied point that this is a questionable definition, though!
I am not sure it’s a bug. [1] also says (emphasis added)
“User agents _may_ prevent the user from causing the element's API value to be set to a value whose length is greater than the element's maximum allowed value length.”
and we have “MAY This word, or the adjective "OPTIONAL", mean that an item is truly optional” [2]
I do not rule out that, if you look at the source code or git history of their code, there are comments or test cases explicitly stating they deviate from enforcing a fixed maximum length in code points if doing so breaks up a grapheme cluster.
Also: what do other browsers do when you paste, for example, an ‘é’ that is made up of two code points ‘e’ and a combining character into a field that accepts only a single further code point? Do they just add an ‘e’?
Most é (like the one in your comment) are just encoded as U+00E9 but manually creating it as a combination and testing on Chrome it does just convert back to being an e on paste.
IMHO that is then a "bug" in the HTML spec which should be "fixed" to speak of extended grapheme clusters instead, which is what users what probably expect.
Except "maxlength" is often going to be added to fields to deal with storage/database limitations somewhere on the server side and those are almost always going to be in terms of code points, so counting extended grapheme clusters makes it harder for server-side storage maximums to agree with the input.
Choosing 16-bit code points as the base keeps consistency with the naive string length counting algorithm of JS, in particular, which has always returned length in 16-bit code points.
Sure, it's slightly to the detriment of user experience in terms of how a user expects graphemes to be counted, but it avoids later storage problems.
UTF-16 seems very likely to me in database fields that cleanly support Unicode. It's the native default (for decades) in SQL Server and Oracle. Up until very recently it was the only Unicode safe character format for MySQL (when a true 8-bit UTF-8 character type was finally added after decades of hacks with broken 7-bit encodings). Postgres is the only SQL database I'm aware of that has had clean UTF-8 support for as long of a time that UTF-16 has been recommended or defaulted in every other SQL database.
Obviously there are a lot of databases out there with hacks using things like bad UTF-8 in unsafe column formats not designed for it, of course. But if you are working with a database like that, codepoint counting is possibly among your least worries.
Yea the authors conclusion is flawed. If I enter an emoji and a different one appears I'm just going to assume your website is broken. Safari is in the right here.
It's not that clear-cut. If you enter that 11-code-unit emoji into a field that has maxLength=10, and the browser lets you do so, but the backend system that receives the data only stores 10 code units, you're worse off -- because you probably won't realise your data got corrupted -- than if the browser had prevented you entering/submitting it.
Any attempt at defining what people think of as characters is going to fail because of how many exceptions our combined writing systems have. See: codepoints, characters, grapheme clusters.
Edit 2: Actually it took 7 backspace to obliviate the whole family. Tough one.
Edit: Oops. I was trying to type emoji into HN comment. apparently it is not supported.
I never knew that [emoji of family of 4] takes 5 backspace to eliminate.
It goes from [emoji of family of 4] to [emoji of family of 3] to [father and mother] to [father] to [father]. Somehow [father] can take double the (key)punch than the rest of his family.
Edge cases like this will just get more common as unicode keeps getting more complex. There was a fun slide in this talk[1] that suggests unicode might be turing complete due to its case folding rules.
I miss when Unicode was just a simple list of codepoints. (Get off my lawn)
These "edge cases" have always existed in Unicode. Languages with ZWJ needs have existed in Unicode since the beginning. That emoji put a spotlight on this for especially English-speaking developers with assumptions that language encodings are "simple", is probably one of the best things about the popularity of emoji.
I think we need some kind of standard "Unicode-light" with limitations to allow it to be used on low specs hardware and without weird edge cases like this. A bit like video codecs that have "profiles" which are limitations you can adhere to to avoid overwhelming low end hardware.
It wouldn't be "universal", but enough to write in the most commonly used languages, and maybe support a few, single codepoint special characters and emoji.
The _entire point_ of Unicode is to be _universal_, you're just suggesting we go back to pre-Unicode days and use different code-pages.
This, for reasons stated in already posted responses, and to put it mildly, does not work in international context (e.g. the vast majority of software written every day).
You can't express the most commonly used languages without multiple code point graphemes.
If you want to eliminate edge cases you would need to introduce new incompatible code points and a 32 or 64 bit fixed length encoding depending on how many languages you want to support.
Extra pages for these extra code points wouldn't seem far fetched to me. We already have single code point letters with diacritics like "é", or a huge code page of Hangul, for which each code point is a combination of characters.
As for encoding, 32 bits fixed length should be sufficient, I can't believe that we would need billions of symbols, combinations included, in order to write in most common languages, though I may be wrong.
Also, "limiting" doesn't not necessarily means "single code point only", but more like "only one diacritic, only from this list, and only for these characters", so that combination fits in a certain size limit (ex: a 32 bit word), and that the engine only has to process a limited number of use cases.
The subset of codepoints that are included in NFKD make probably a decent starting point for such standard, maybe if you want to be even more restrictive then limit to BMP.
Is it just that whole block that's allowed? I feel like some of the legacy characters that have been kind of "promoted" to emoji I've seen allowed here. Let's see, is the all-important 🆗 allowed?
The point of having a standard is to know which ones to deny.
Make a study over a wide range of documents about the characters that are used the most, see if there are alternatives to the characters that don't make it, etc...
This is unlike the HN ban on emoji, which I think is more of a political decision than a technical one. Most people on HN use systems that can read emoji just fine, but they decided that the site would be better without it. This would be more technical, a way to balance inclusivity and technical constraints, something between ASCII and full Unicode.
I’ve worked on large data entry forms for decades. I stopped using maxlength a long time ago because of this. Entry should be free form, truncation is unexpected behavior. Validation should catch the limitation and never manipulate what the user entered. People paste not realizing text got cutoff and information gets lost. They type without looking at the screen. I’ve even seen it used on a password setting form but not on the sign in form so anyone that thinks they set a password larger than the limit thinks it succeeded but login fails and even tech support gets baffled for hours.
We already know client side limits aren’t enough and server validation is required anyway. Trying to cleverly “help” entry usually just causes headaches for users and devs. Dynamically showing/hiding, enabling/disabling, focus jumping, auto formatting, updating other fields based on other values, is usually confusing for the user and more difficult to program correctly. Just show it all, allow all entry and catch everything in the validation code, everyone will be happier.
Odd, it makes sense on a technical level when it comes to these ZWJ characters, but hiding the implementation makes sense from Safari’s point of view. I’d actually prefer that as a UNIVERSAL standard, visible symbols vs characters. (When it comes to UI)
But I can also imagine this is problematic when setting validation rules else where and now there’s a subtle foot gun buried in most web forms.
I guess the thing to learn here is to not rely on maxlength=
I hope no one is relying on it! But in the context of a regular user, maxlength measuring all chars is failsafe (user is allowed to submit a long string that should be caught by validation elsewhere) vs measuring visible chars is faildangerous (user can’t submit a valid string because the front end validation stops them)
Seems like an obvious root cause for these sorts of things are languages (and developers) who can't or won't differentiate between "byte length" and "string length". Joel warned us all about this 20 years (!!) ago[1], and we're still struggling with it.
Good point. When it comes to text there are a lot of "lengths". Languages should differentiate between each of those lengths, and programmers should pay attention to the difference and actually use the right one for the whatever situation they're programming.
No, it's definitely not. You can read the byte length more directly in JS, and use that to inform if more text is allowed or not.
But the maxlength attribute is at best an approximation. Don't rely on it for things like limiting length for database fields (not that you should trust the client anyway).