No, it's definitely not. You can read the byte length more directly in JS, and use that to inform if more text is allowed or not.
const encoder = new TextEncoder();
const currentBytes = encoder.encode(inputStr).byteLength;
HTML is for UIs and I don't think many users would expect an emoji to be counted as 11 symbols when it looks like just one. If this wasn't an emoji but a symbol from a more complex writing system splitting that up would clearly be unacceptable, so I don't see why one should make an exception here just because this one known example of a family emoji still mostly survives the truncation.
I don't actually know whether ninja cat would count as one or two though. The spec for calculating grapheme boundaries is actually several pages long.
And the OS might not obey the Unicode standard. Ninja cat appears to be proprietary to Microsoft.
Even the overview includes this note:
> Note: Font-based information may be required to determine the appropriate unit to use for UI purposes
First of all, grapheme counting depends on normalization and there's like 6 Unicode normalization algorithms to consider, depending on locale and intended use/display style.
Keep in mind that grapheme counting includes things like natural ligatures which English has always had a rough time counting. Things like `fi` and `ft` that sometimes form single visual graphemes from the font's perspective for some serif fonts but are always supposed to be counted as two graphemes from the user's perspective.
Relatedly, one of the simpler parts of the Unicode standard algorithm is that the ZWJ always signals a grapheme cluster/merged grapheme, so per the standard algorithm the Ninja Cat is always a single grapheme even though the usual emoji rules for all the non-Microsoft fonts that don't include Cat+ZWJ+Job sequences ("cats with jobs emojis") fallback to two display characters to try to get the idea across, the spec does say that they should still count that as only a single grapheme even when not presented as such.
(ETA Aside: I think the cats with jobs emojis are great and fun and should be wider adopted outside of just Microsoft. Why should only the people emojis have jobs?)
It says "The boundary specifications are stated in terms of text normalized according to Normalization Form NFD (see Unicode Standard Annex #15, “Unicode Normalization Forms” [UAX15]). In practice, normalization of the input is not required."
And there's also "Even in Normalization Form NFC, a syllable block may contain a precomposed Hangul syllable in the middle."
So are you sure normalization matters? I'm excluding K normalizations since they're not for general use and alter a bunch of characters on purpose. But also that's only NFC/NFD/NFKC/NFKD, are there two others or was "six" just a misremembering?
> I'm excluding K normalizations since they're not for general use
"Not for general use" doesn't mean that they aren't somebody's use and still a complicating factor in everything else.
NFC/NFD will give you the same graphemes.
Or to put it another way, there's only two real normalizations, and there's some weird junk off to the side.
And nope, it doesn't depend at all on font, user agent, or locale. Nor does it depend on normalization (that's explicitly stated). And like you say, a ligature for 'fi' is two graphemes. All of this is for the purposes of text selection, counting characters, etc.
You're right there are additional considerations in rendering/UX, particularly with font ligatures. But that operates at a different level from graphemes.
(The spec also explains that more sophisticated grapheme definitions can be constructed for locales. But "The Unicode definitions of grapheme clusters are defaults: not meant to exclude the use of more sophisticated definitions of tailored grapheme clusters where appropriate." So it works as a single shared canonical reference.)
I included this note from that very report:
> Note: Font-based information may be required to determine the appropriate unit to use for UI purposes, such as identification of boundaries for first-letter paragraph styling.
You fixated on this additional pull quote:
> The Unicode definitions of grapheme clusters are defaults: not meant to exclude the use of more sophisticated definitions of tailored grapheme clusters where appropriate. Such definitions may more precisely match the user expectations within individual languages for given processes.
I would add additional emphasis on the word default there which the document in other places contrasts to tailored as the preferred terminology between "simple algorithm rules of thumb" that this particular algorithm provides and user-focused/locale-dependent grapheme considerations (which still do vary a lot).
For instance, this bit on tailored graphemes:
> Grapheme clusters can be tailored to meet further requirements. Such tailoring is permitted, but the possible rules are outside of the scope of this document. One example of such a tailoring would be for the aksaras, or orthographic syllables, used in many Indic scripts.
(And a later table of more examples of tailored graphemes that the "default grapheme" algorithm will not catch the subtleties of.)
As for relying on normalization, I may have misread this particular pull quote:
> A key feature of default Unicode grapheme clusters (both legacy and extended) is that they remain unchanged across all canonically equivalent forms of the underlying text. Thus the boundaries remain unchanged whether the text is in NFC or NFD.
I read that as implying that algorithm input needed to be NFC or NFD (though there is clearly no difference to the algorithm which of the two forms it is in), not that the algorithm should also just as well work on unnormalized inputs. Rereading it again, I'm still not sure if the algorithm is well-defined on unnormalized inputs, but I can see how it reads as a possible implication that it might work. (Maybe?)
I understand that technical limitations will sometimes cause a combined character to be displayed as more than one character, as it happens in Safari with that ninja cat, and in those cases, it's preferable to decompose the emoji, or use a tofu character, than changing it for another.
Being much more frequent than other more exotic characters they just reveal bugs more often.
> Unicode, formally The Unicode Standard,[note 1][note 2] is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems
People write smiley faces, but there was already a code point for basic smileys.
I think standardized emoticons are a good thing, but I don’t think the Unicode spec is where they best belong.
Emoticons are a "writing system" (whether or not you believe them to be more "text" or "meta-text") that needs encoding, and one that was encoded in standards that predated Unicode. (They were at first added for compatibility with CJK systems that had already been using them for some time. They've been greatly expanded since those original encodings that Unicode "inherited" from other text encodings, but they started as text encodings in other standards.)
Personally, I think Unicode actually did the world a favor by taking ownership of emoji and standardizing them. There's nothing "weird" that emoji do that doesn't exist somewhere else in some other language's text encodings. Things like ZWJ sequences for complex graphemes existed in a number of languages that Unicode encoded well before emoji. Emoji being common and popular helps break certain Western-centric assumptions that language encodings are "simple" and 1-to-1 code point to grapheme, assumptions that had been needed to break for many years before emoji gave Unicode users a tool to test that even "Western developers" needed to respect if they wanted happy users. Better emoji support is better Unicode support across the lovely and incredible diversity of what Unicode encodes.
In the end emoji is a language, a visual icon language.
Edit: Looks like HN stripped out the character codes. The effect can be reproduced by copy-pasting the symbols from https://en.wikipedia.org/wiki/Regional_indicator_symbol
This is also why Microsoft-vended emoji fonts don't include flags/"regional indicators" support at all. On a Windows machine you just see the letters US or TW in an ugly boxed type and nothing like a flag.
The interesting background story goes: everyone remembers that Windows 95 had that fancy graphical Time Zone selector that looked real pretty in screenshots (but wasn't actually the best way to select time zones anyway). The reasons Microsoft removed that tool are less well known and almost entirely geopolitical: time zone borders often correspond to country borders and several countries got mad at their border looking wrong and sued Microsoft for "lying" about their borders. After a bunch of micro-changes to the graphics of that widget and then a lot of money spent on lawsuits and geopolitical fights, Microsoft eventually just entirely removed the widget because it was a "nice-to-have" and never actually necessary. It is said that from that fight Microsoft decided that it never wanted to be involved in that sort of geopolitics ever again, and flags are nothing if not geopolitical symbols, so Microsoft's emoji fonts don't encode hardly any flags at all. (In exchange they encode a lot of fun "cats with jobs" that other font vendors still don't. I find that a fun trade-off myself.)
A string length is now the number grapheme clusters, or just the width.
Do database engines all agree on edge cases like this?
> Constraint validation: If an element has a maximum allowed value length, its dirty value flag is true, its value was last changed by a user edit (as opposed to a change made by a script), and the length of the element's API value is greater than the element's maximum allowed value length, then the element is suffering from being too long.
Where "the length of" is a link to :
> A string’s length is the number of code units it contains.
And "code units" is a link to :
> A string is a sequence of unsigned 16-bit integers, also known as code units.
I agree with your implied point that this is a questionable definition, though!
“User agents _may_ prevent the user from causing the element's API value to be set to a value whose length is greater than the element's maximum allowed value length.”
and we have “MAY This word, or the adjective "OPTIONAL", mean that an item is truly optional” 
I do not rule out that, if you look at the source code or git history of their code, there are comments or test cases explicitly stating they deviate from enforcing a fixed maximum length in code points if doing so breaks up a grapheme cluster.
Also: what do other browsers do when you paste, for example, an ‘é’ that is made up of two code points ‘e’ and a combining character into a field that accepts only a single further code point? Do they just add an ‘e’?
Choosing 16-bit code points as the base keeps consistency with the naive string length counting algorithm of JS, in particular, which has always returned length in 16-bit code points.
Sure, it's slightly to the detriment of user experience in terms of how a user expects graphemes to be counted, but it avoids later storage problems.
Obviously there are a lot of databases out there with hacks using things like bad UTF-8 in unsafe column formats not designed for it, of course. But if you are working with a database like that, codepoint counting is possibly among your least worries.
The length in code units is unambiguous and constant across versions.
A good starting place is UAX #29: https://www.unicode.org/reports/tr29/tr29-41.html
However, the gold standard in UI implementations is that you never break the user's input.
It starts by breaking down common Unicode assumptions folks have
People don't think about code points and they definitely don't think about code units.
What does "character" mean?
Grapheme clusters aren't perfect but they're far ahead of code whatevers.
Edit: Oops. I was trying to type emoji into HN comment. apparently it is not supported.
I never knew that [emoji of family of 4] takes 5 backspace to eliminate.
It goes from [emoji of family of 4] to [emoji of family of 3] to [father and mother] to [father] to [father]. Somehow [father] can take double the (key)punch than the rest of his family.
I miss when Unicode was just a simple list of codepoints. (Get off my lawn)
It wouldn't be "universal", but enough to write in the most commonly used languages, and maybe support a few, single codepoint special characters and emoji.
This, for reasons stated in already posted responses, and to put it mildly, does not work in international context (e.g. the vast majority of software written every day).
If you want to eliminate edge cases you would need to introduce new incompatible code points and a 32 or 64 bit fixed length encoding depending on how many languages you want to support.
As for encoding, 32 bits fixed length should be sufficient, I can't believe that we would need billions of symbols, combinations included, in order to write in most common languages, though I may be wrong.
Also, "limiting" doesn't not necessarily means "single code point only", but more like "only one diacritic, only from this list, and only for these characters", so that combination fits in a certain size limit (ex: a 32 bit word), and that the engine only has to process a limited number of use cases.
Plenty of software already does that-- HN itself doesn't allow emoji, for example.
Make a study over a wide range of documents about the characters that are used the most, see if there are alternatives to the characters that don't make it, etc...
This is unlike the HN ban on emoji, which I think is more of a political decision than a technical one. Most people on HN use systems that can read emoji just fine, but they decided that the site would be better without it. This would be more technical, a way to balance inclusivity and technical constraints, something between ASCII and full Unicode.
We already know client side limits aren’t enough and server validation is required anyway. Trying to cleverly “help” entry usually just causes headaches for users and devs. Dynamically showing/hiding, enabling/disabling, focus jumping, auto formatting, updating other fields based on other values, is usually confusing for the user and more difficult to program correctly. Just show it all, allow all entry and catch everything in the validation code, everyone will be happier.
But I can also imagine this is problematic when setting validation rules else where and now there’s a subtle foot gun buried in most web forms.
I guess the thing to learn here is to not rely on maxlength=
Which is weird
Here is a video of another failure case:
No, it's definitely not. You can read the byte length more directly in JS, and use that to inform if more text is allowed or not.But the maxlength attribute is at best an approximation. Don't rely on it for things like limiting length for database fields (not that you should trust the client anyway).