Hacker News new | past | comments | ask | show | jobs | submit login
What every software developer must know about Unicode in 2023 (tonsky.me)
989 points by mrzool 11 months ago | hide | past | favorite | 555 comments





There's one part of this document that I would push extremely hard against, and that's the notion that "extended grapheme clusters" are the one true, right way to think of characters in Unicode, and therefore any language that views the length in any other way is doing it wrong.

The truth of the matter is that there are several different definitions of "character", depending on what you want to use it for. An extended grapheme cluster is largely defined on "this visually displays as a single unit", which isn't necessarily correct for things like "display size in a monospace font" or "thing that gets deleted when you hit backspace." Like so many other things in Unicode, the correct answer is use-case dependent.

(And for this reason, String iteration should be based on codepoints--it's the fundamental level on which Unicode works, and whatever algorithm you want to use to derive the correct answer for your purpose will be based on codepoint iteration. hsivonen's article (https://hsivonen.fi/string-length/), linked in this one, does try to explain why extended grapheme clusters is the wrong primitive to use in a language.)


Agreed. And one more consideration is that (extended) grapheme cluster boundaries vary from one version of Unicode to another, and also allow for "tailoring." For example, should "อำ" be one grapheme cluster or two? It's two on Android but one by Unicode recommendation and is the behavior on mac. So in applications where a query such as length needs to have one definitive answer which cannot change by context, counting (extended) grapheme clusters is the wrong way to go.


In fact, the name "extended" grapheme cluster should give it away. There was already a major revision to UAX #29 so that the original version is now referred as to "legacy". Your example is exactly this case: the second character, U+0E33 THAI CHARACTER SARA AM, prohibits a cluster boundary now but previously didn't [1].

[1] Relevant specifications: https://unicode.org/reports/tr29/#SpacingMark and https://unicode.org/reports/tr29/#GB9a


> An extended grapheme cluster is largely defined on "this visually displays as a single unit", which isn't necessarily correct for things like "display size in a monospace font" or "thing that gets deleted when you hit backspace."

I'm sorry, but I fail to see how "This visually displays as a single unit" could ever differ from "Display size in a monospace font" or "Thing that gets deleted when you hit backspace".


A couple of cases I'm aware of...

* Coding ligatures often display as a single glyph (maybe occupying a single-width character space, or maybe spread out over multiple spaces), but are composed of multiple glyphs. The ligature may "look" like a single character for purposes of selection and cursoring, but it can act like multiple characters when subject to backspacing.

* Similarly, I've seen keyboard interfaces for various languages (e.g., Hindi) where standard grapheme cluster rules bind together a group of code points, but the grapheme cluster was composed from multiple key presses (which typically add one code point each to the cluster). And in some such interfaces I've seen, the cluster can be decomposed by an equal number of backspace presses. I don't have a good sense of how much a monospaced Hindi font makes sense, but it's definitely a case where a "character" doesn't always act "character-like".


I've always felt ligatures that condense two or more glyphs into something that takes up the space of only one in a monospace font are going beyond what a font should handle and into the realm of what an editor should do. I have several such visual substitutions set up in my .emacs but I don't use fonts that do them on their own.


What about ligatures that make ASCII characters display differently when in proximity, but still use the same number of columns?

For example, when == is written, connect them to be a 2 column wide = instead.

Or when === is written, display a three column equals sign, but it's three bars instead of two.


> Display size in a monospace font

Some clusters are going to be multiple characters wide.

> thing that gets deleted when you hit backspace

Some clusters are meant to be composted of multiple keystrokes and a natural editing experience would allow users to delete the last stroke.

Look into how Korean works.


See, e.g., https://github.com/xi-editor/xi-editor/issues/655 for why backspace isn't the same as extended grapheme cluster.

As for "display size in monospace font", emojis and CJK characters are usually two units wide, not one (although, to be honest, there's a fair amount of bugs in the Unicode properties that define this).


Here’s a good example of the test cases used for backspaces in Android[1]. It’s definitely more involved than just deleting a grapheme cluster.

[1] https://android.googlesource.com/platform/frameworks/base/+/...



If you type "a", combine it with "´", then change your mind and hit backspace, you probably want to end up with "a" even through "á" was a thing "visually displayed as a single unit".


As a European, no I don't. á isn't used in my language, but my layout offers it via a dead-key-then-base-letter mechanism, and it is correctly treated as one unit when pressing backspace, anything else would feel incorrect. It would be even worse if such a thing happened for the letters that my layout offers individual buttons for (ÅÄÖ). Some languages do treat these as letters with attached modifiers, but many, including mine, treat them as indivisible letters that just happen to look similar to some others for historical reasons, and to treat them as combinations of base letters and diacritics would be completely incorrect, even if you typed them in using the dead-key-then-base-letter mechanism for some reason.


Most European keyboard layouts have it the other way around: first press a "dead key" for the diacritic mark and then the letter to apply it to.

Where some layouts may require this method for some characters, another keyboard layout may have the same character on a dedicated key.

The program receives the combined character as one unit, and does not need to be aware of different keyboard layouts.


> first press a "dead key" for the diacritic mark and then the letter to apply it to.

That being exactly the way “floating diacritics” in ISO 2022 (or properly one of its Latin encodings, T.51 = ISO 6937) work, amusingly. I wonder which came first. (Yes, I know that a<BS>` came first, the ASCII spec even says that this should give you an accented character IIRC. Or perhaps it was one of the other “don’t call it ASCII” specs—ISO 646? IA5?..)


> Most European keyboard layouts have it the other way around: first press a "dead key" for the diacritic mark and then the letter to apply it to.

Which ones? At least the French and German ones don’t work like that: there is no composing, just separate keys for all the characters with diacritics that appear in the language.


Nitpicking but most french keyboards have both ready-made keys for "é" and the few other commonly use keys and composing: hitting either '¨' or '^'. For example hitting '¨' then 'e' produces "ë".


You are right, thanks.


The AZERTY layout is nothing if not inconsistent.


For this specific example, it is actually quite pragmatic. "é" being used many orders of magnitude more often than "ë" in French, it makes sense for it to have its own key.


Also, French has no other character that takes an acute accent. For the same reason, ç isn't typed with a dead key on French AZERTY.


The nordic layout(s) offer such a mechanism to allow people to type in letters that you'll find in various other European languages, even though the extra letters used in the languages themselves (ÅÄÖÆØ) are present as their own keys. Interestingly, the Swedish layout has no dedicated é key, although é occurs in some Swedish words.


In Swedish, Å, Ä, and Ö are actual letters of the alphabet, while é is used in foreign words. Like the English dieresis (e.g. in coöperate) is essentially unknown in the US and only occasionally used in England, so doesn't give rise to characters with dieresis on the keyboard.


é is used commonly in names and some words that don't feel foreign. For example the word for idea is written idé. Seems like it's an old loan from greek.


The accent gives away that this is in fact a loan from French


I guess it's time to learn to use some real swedish words then and not the foreign ones. Bye idé, hello hugskott. (Hug/håg = mind, skott = shot)


Which French layout would that be? I've never seen a French keyboard where this is true. French is my native language. On layouts I'm familiar with, some accented letters have separate keys like é, but not all, the others are made by composing an accent key with a letter.


You're right, sorry. I had forgotten about the ^ and ¨ keys.


Press the key to the left of 1 (not the numpad) or the right of the Eszett (sharp S) on the German QWERTZ keyboard and you probably hit a dead key. There are dedicated keys for the German umlauts and Eszett but these are for French accents in loan words: â, á and à, e.g. as in Café.

It's worth mentioning of course that there are no-dead-keys variants of the keyboard layout but this has been pretty much the norm on Windows since the 1990s I think.


On the German Layout the backtick (next to the 1 key) is a dead key.


Whichbis the thing that finally pushed me over the edge and switch to the US layout. backticks are something I use all the time.


Danish keyboards also require you to press '¨' first and then 'o' to produce 'ö'.


Do the danes not have the mechanism that is found on Finnish keyboard layouts, where pressing AltGr+Ö yields Ø and AltGr+Ä yields Æ, except in reverse?


Those mappings are not universal. They are present under Linux but not on MS-Windows. I don't know about Mac, but the layout has in the past been slightly different there from Windows also.


Interesting, it's been like a decade since I last used windows, but I had to go and check what layouts are available, since I remember having these combinations on my layout. Apparently those combinations are provided by windows in the "Finnish and Sami" layout, which provides a number of extra letters (not just ones used by the Sami languages) through AltGr+letter combinations. I must have selected that as my layout at some point while I was still using windows, possibly for the purpose of getting easier access to letters like ÆØÕ, and just forgotten it after some time.


They are there on mac too, use option-Ä to get the Æ or the other way around. What's more, it has worked like that since system 7.x times or so, it's just a good idea.


For me that doesn't work on Windows. Those key combinations doesn't seem to do anything.


I had to change settings on Windows to get access to a mode. I could then enable that mode to be able to readily type Spanish correctly. The mode uses the key combinations as described.


But do you really use ö much over ø?


No, but I do once in a while (very rarely) write a little in german that might use that character.


Slovak or Czech for example.


Danish is one.


I expect to delete the character "á". And I prefer consistency too so I expect "œ" and "<emoji>" and "<emoji>" to be deleted as one unit.

edit: emojis are filtered by HN


Even the emoji's that you create by combining multiple emojis? Type one emoji, then a second, it merges into one. What happens when you backspace?


I expect the full emoji to be deleted. Because it merged into one visual entity.

I guess it's more ambiguous for some languages that can have long ligatures though.


I'd say delete the whole with backspace, but only the last if you undo.


But then if I type "á" directly (through, say, a mobile keyboard) and hit backspace, I'd get "a", which doesn't seem terrible but does feel a little off.

Seems like the right answer for codepoints vs graphemes, unfortunately, is dependent on the context.


In terminals there is a distinction between single-width and double-width characters (east-asian characters, in particular). E.g. the three characters

    A美C
would take up the width of four ASCII monospace characters, the “美” being double-width.

Similarly, for composed characters like say the ligature “ff”, you may want to backspace as if it was two “f”s (which logically it is, and decomposes to in NFKD normalization).


Unicode even has distinct full width and half width variants of Japanese katakana - where ‘full width’ is (in theory) as wide as two Latin characters.

   Latin:      Katakana
   Full width: カタカナ
   Half width: カタカナ
(How that fixed width text looks in a web browser is anyone’s guess though. On iOS none of the Japanese kana stay on the fixed grid.)


The background here is that both are contained in the Japanese Shift JIS character set, and Unicode provides roundtrip compatibility. And they are in Shift JIS because the half-width katakana were in the 8-bit JIS character set [0] used with text-mode displays where all characters have the same width. To preserve screen layout, these later had to be distinguished from full-width katakana.

[0] https://en.wikipedia.org/wiki/JIS_X_0201


𒐫

𒈙


If I have a text with niqqudim I am going to want to think of the niqqudim differently when editing despite the fact they are entwined with the consonants.


characters that alter their appearance to be one or more display units depending on the characters that are next to it (before and after). that would be a very crude example, but these types of characters appear all the time in human language


There are libraries that help with iterating both code-points and grapheme clusters... - but are there any of them that can help decide what to do for example when pressing backspace given an input string and a cursor position? Or any other text editing behavior. This use-case-dependent behavior must have some "correct" behavior that is standardized somewhere?

Like a way to query what should be treated like a single "symbol" when selecting text? Basically something that could help out users making simple text-editors. There are so many bad implementations out there that does it incorrectly so there must be some tools/libraries to help with this? Not only for actual applications but for people making games as well where you want users to enter names, chat or other text. Not all platforms make it easy (or possible) to embed a fully fledged text editing engine for those use-cases.

I can imagine that typing a multi-code-point character manually by hand would allow the user to undo their typing mistake by a single backspace press when they are actively typing it, but after that if you return to the symbol and press backspace that it would delete the whole symbol (grapheme cluster).

For example if you manually entered the code points for the various family combination emojis (mother, son, daughter) you could still correct it for a while - but after the fact the editor would only see it as a single symbol to be deleted with a single backspace press?

Or typing 'o' + '¨' to produce 'ö' but realizing you wanted to type 'ô', there just one backspace press would revert it to 'o' again and you could press '^' to get the 'ô'. (Not sure that is the way in which you would normally type those characters but it seems possible to do it with unicode that way).


I'd argue that you must use grapheme clusters for text editing and cursor position, because here are popular characters (like ö you used as example) which can be either one or two codepoints depending on the normalization choice, but the difference is invisible to the user and should not matter to the user, so any editor should behave exactly the same for ö as U+00F6 (LATIN SMALL LETTER O WITH DIAERESIS) and ö as a sequence of U+006F (LATIN SMALL LETTER O) and U+0308 (COMBINING DIAERESIS).

Furthermore, you shouldn't assume that there is any relationship between how unicode constructs a combined character from codepoints with how that character is typed, even at the level of typing you're not typing unicode codepoints - they're just a technical standard representation of "text at rest", unicode codepoints do not define an input method. Depending on your language and device, a sequence of three or more keystrokes may be used to get a single codepoint, or a dedicated key on keyboard or a virtual button may spawn a combined character of multiple codepoints as a single unit; you definitely can't assume that the "last codepoint" corresponds to "last user action" even if you're writing a text editor - much of that can happen before your editor receives that input from e.g. OS keyboard layout code; your editor won't know whether I input that ö from a dedicated key, a 'chord' of 'o' key with a modifier, or a sequence of two keystrokes (and if so, whether 'o' was the first keystroke or the second, opposite of how the unicode codepoints are ordered).


> I'd argue that you must use grapheme clusters for text editing and cursor position

Korean packs syllables into Han-script-like squares, but they are unmistakably composed of alphabetic letters, and are both typed and erased that way (the latter may depend on system configuration), yet the NFC form has only a single codepoint per syllable (a fortiori a single grapheme cluster). Hebrew vowel markings are (reasonably) considered to be part of the grapheme cluster including their carrier letter but are nevertheless erased and deleted separately. In both of those cases, pressing backspace will erase less than pressing shift-left, backspace; that is, cursor movement and backspace boundaries are different.

There are IIRC also scripts that will have a vowel both pronounced and encoded in the codepoint stream after the syllable-initial consonant but written before it; and ones where some parts of a syllable will enclose it. I don’t even want to think how cursor movement works there.

Overall, your suggestion will work for Latin, Cyrillic, Greek, and maybe other nonfancy scripts like Armenian, Ge’ez, or Georgian, but will absolutely crash and burn when used for others.


OK, I understand that the initial sentence is too strict, however, using codepoints for text editing and cursor position is even worse - even in your example of Korean there's a clear distinction depending on how the same character is encoded (combined NFC or not), but it should be the same to the user; and obviously if someone inputs a latin-diacritic character by pressing a modifier key before the base letter, then backspace removing the diacritic (since unicode modifiers are after the base letter) would be just ridiculous.

Backspace in general seems to be a very difficult problem because of subtly incompatible expectations depending on the context, as 'undo last input' when you're typing new text, and 'delete previous symbol' if you're editing existing text.


Some platforms (e.g., Android) have methods specifically for asking how to edit a string following a backspace. However, there's no standard Unicode algorithm to answer the question (and I strongly suspect that it's something that's actually locale-dependent to a degree).

On further reflection, probably the best starting point for string editing on backspace is to operate on codepoints, not grapheme clusters. For most written languages, the various elements that make up a character are likely to be separate codepoints. In Latin text, diacritics are generally precomposed (I mean, you can have a + diacritic as opposed to precomposed ä in theory, but the IME system is going to spit out ä anyways, even if dead keys are used). But if you have Indic characters or Hangul, the grapheme cluster algorithm is going to erroneously combine multiple characters into a single unit. The issue is that the biggest false positive for a codepoint-based algorithm is emoji, and if you're a monolingual speaker whose only exposure to complex written scripts is Unicode emoji, you're going to incorrectly generalize it for all written languages.


IMHO backspace is not an undo key. Use CTRL+Z if you want to undo converting a grapheme to another grapheme with a diacritic character. Backspace should just delete it.

On the other hand, a ligature shouldn't be deleted entirely with just one backspace. It's two letters after all, just connected.

So how do we distinguish when codepoints are separate graphemes, and when they constitute a single grapheme? Based on if they they can still be recognized as separate within the glyph? If they combine horizontally vs vertically (along the text direction or orthogonal?) What about e.g. "¼" - are those 3 graphemes? What about "%" and "‰"? What about "&" ("et" ligature)? It seems you can't run away from being arbitrary…


> Or typing 'o' + '¨' to produce 'ö' but realizing you wanted to type 'ô', there just one backspace press would revert it to 'o' again and you could press '^' to get the 'ô'.

This is a good example because in German I would expect 'o' + '¨' + <delete> to leave no character at all while in French I would expect 'e' + '`' + <delete> to leave the e behind because in my mind it was a typo.

The rendering of brahmic- and arabic-derived scripts makes these choices even more interesting.


But typing "ö" (e.g. swiss keyboard) and pressing delete & getting an o would be annoying af


Same for a French keyboard with éèàù which are all typed using one key. But even êôûæœäëïöü, all typed using at least two keys, if not 3 with a compose key (from memory, I'm using a phone). Everybody is used to the way it has been working on all OSes.


I realize that the editor would be the system to keep track of how the character was entered for this to work. If you made the character from a single keypress it would only make sense that backspace also undid the entire character. Only if you created the character from multiple keypresses it would make sense to "undo" only part of it with backspace (at least until you move away from the character).


> make sense to "undo" only part of it with backspace

I'm not sure that ever really makes sense: it would be a misnomer if "backspace" didn't bring you "back" some amount of horizontal "space," I reckon. This logic holds up not only for cases like ö and emoji (where I'd hope the whole grapheme disappears), but also for scenarios like if one types <f><i> and an <fi> ligature appears, where I'd hope only the <i> disappears: that's fine because you are still going back some space.

If the key ever gets repurposed from "backspace" to "undo" then I would agree that it should step back to the previous state with as much granularity as possible.


“Backspace” is already a misnomer: the original intention was for it to move the caret one position back, the other way around from the usual space, thus enabling overstriking (whence also the characters _ ^ ` , unheard of before the typewriter age). You can still see this used by the troff|less internals of man, which encode underline and bold as _<BS>a and a<BS>a respectively.


The text already contains the data about how the character was entered (more specifically, what codepoints it's made of), unless it has been normalized somehow. It seems like a tarpit to me to properly specify these behaviours in a way that won't be an annoyance/surprise to a lot of people.


It would probably expose the implementation too much, but if you wanted combined characters to be split apart, I would expect "ö" to be removed in one go, whereas an "o" with a combined diaeresis (No idea if it's called that, I got the name from the code chart) would take two backspaces.


The mark with two dots can refer to various things.

Dieresis (Greek word we use in English) is a phenomenon where two vowels don’t merge together to make a new sound. Co-operate, coöperate, or cooperate. You could also shove a glottal stop in there but this is much less extreme. The diaresis marker isn’t used much in the USA. Another common example is naïve.

Umlaut (German word) is the opposite phenomenon: the first vowel sound is inflected by the following, so in German Apfel (apple, pronounced UPfell) becomes Aepfel (apples, EPPfell), instead commonly written Äpfel instead.

People usually refer to the marks by the term in their own language (umlaut, diaresis, caron, accent…). I say Umlaut in German (which doesn’t mark the phenomenon, you just have to know) and diaresis in English though I’m talking about the same symbol.

But diacritics are used in no systematic way in different languages anyway (this is true of the letters themselves too: letters like S, W, V and Z have different pronunciations in English and German, for example). Ppl just needed ways to say “the sound here is similar to the sound of this letter but not really the same” or “when you say this letter also do something else” (like, say, add stress for languages that have that).

English mostly dispenses with them, sometimes adding a letter (g vs gh) but mostly having a very casual relationship with the sound at all. You simply have to know that cooperate isn’t pronounced like chicken coop. I personally think that’s a good thing.


If you want to refer to the symbol in a neutral way there is the term trema


Pragmatically, to get from "ö" to "o" one would delete "ö" (hopefully by pressing backspace) and then type "o"; deleting the combining marks isn't actually useful.


Definitely agree with that! I use a US kbd (incl on phone) no matter what language I’m writing in. A little annoying but switching kbd layouts is more disruptive for me.


In French, è is a single character issued by a single keypress on a French keyboard, like e, or +. (Note that A is shift+a). Why should it need two backspaces? If you press e+` well you have e`, not è.


I am assuming that means "on French keyboard", not "in French". I have a usa keyboard and live in Canada...Every now and then it thinks I'm typing French and keyboard indeed behaves in a way that some vowel plus some quotation mark indeed gives me some other character (that I don't need :)


That would feel very strange to me. The Canadian layout probably behaves differently, but à is a modifier letter, not two letters. I’d expect backspace to remove the whole letter, including the accent.


Oh, it's strange to me too! I curse every time I accidentally hit that setting and my computer seems possessed :)

It appears there's a few keyboard layouts and language options:

French (Canada) - Canadian French

French (Canada) - Canadian Multilingual Standard

French (France)

etc

And then I think it intersects with what keyboard you actually have.

Some of these layouts maybe are designed to enable you to type in French characters on a non-French keyboard? Not sure.


Behavior that depends on whether you edited something else in between, or that depends on timing, is just bad. Either always backspace grapheme clusters, or else backspace characters, possibly NFC-normalized. I could also imagine having something like Shift+Backspace to backspace NFKD-normalized characters when normal Backspace deletes grapheme clusters.

As for selection and cursor movement, grapheme clusters would seem to be the correct choice. Same for Delete. An editor may also support an “exploded” view of separate characters (like WordPerfect Reveal Codes) where you manipulate individual characters.


Everybody loves to debate what "character" means but nobody ever consults the standard. In the Unicode Standard a "character" is an abstract unit of textual data identified by a code point. The standard never refers to graphemes as "characters" but rather as user-perceived characters which the article omits.


I'm not Korean but seeing that said of the Hangul example definitely made me pause - I doubt Koreans think of that example as a single grapheme (open to correction), though it is an excellent example all the same since it demonstrates the complexity of defining "units" consistently across language.

It reminds me a little of Open Street Map's inconsistent administrative hierarchies ("states", "countries", "counties", etc. being represented at different administrative "levels" in their hierarchy for each geographical area), and how that hinders consistency in styling- font size, zoom levels, etc. being generally applied by level.


As a native Korean, I can confirm that "각" is perceived as a single character. But the example itself is bad anyway because everyone use a precomposed form U+AC01 instead of U+1100 U+1161 U+11A8 instead (they are canonically equivalent). This is more clear when you also consider a compatibility representation "ㄱㅏㄱ" U+3131 U+314F U+3131, which is same to "각" after compatibility normalizations (NFKC or NFKD), but perceived as three atomic characters in general.


Thanks for the clarification, that's interesting.

My impression before was it would be considered a single "entity" (not the same as a roman-alphabetic word, but not a character either) containing 3 characters.


In that case, it sounds like `length` on Unicode strings simply shouldn't exist, since there is no obvious right answer for it. Instead there should be `codepointCount`, `graphemeCount`, etc.


There are basically 2 places where programmers mostly want the "length" of a string:

1. To save storage space or avoid pathological input, they want to limit the "length" of text input fields. E.g., not allow a name to be 4 KB long

2. To fit something on screen

Developers mostly used to western languages can approximate both with "number of letters", but the correct answers are

For 1. Limit to bytes to avoid people building infinite zalgo characters, but be intelligent about it - don't just crop the byte array not taking into account graphemes.

For 2. This sucks, especially for the web, but the only really correct answer here is to render it and check.

Did I miss any other common cases?


Presizing buffers, initializing for loop counts ...


Unrelated to this post, but you suggested (https://news.ycombinator.com/item?id=37381390) I use your company hydraulic.dev for my electron build. I ultimately gave up on it, then someone from node-gyp comment on an issue I opened about it, and provided the solution:

https://github.com/electron-userland/electron-builder/issues...

Just wanted to let you know in case it's a gotcha you might not be aware of that might help you out if you run into similar problems with some of your customer builds.


Thanks for the tip! Sorry to hear you gave up on it, it'd be really appreciated if you could email me with some info as to what problems you hit. We're improving our Electron support at the moment (adding ASAR support and so on), so if there's low hanging fruit it'd be good to know where to look.

The bug you linked to is a bit confusing, it seems to be a bug in electron-builder (or node-gyp), not Conveyor?


Buffers should just be working on raw byte arrays without even considering the content (if it's a string or data or whatever).

"for loop counts" depends on what you're doing in the loop...


You're absolutely correct! `length` is ambiguous - you shouldn't have a `time` argument in a `sleep` function; you should have `milliseconds` and/or `seconds` etc.


You could have a Duration argument though.

The parallels of string length with the phrase "How long is a piece of string?"[0] make this apparent/amusing. I'm sure I'm not the first person to think that.

[0]: https://en.wiktionary.org/wiki/how_long_is_a_piece_of_string


The `Duration` being a type that implements an interface removing the ambiguity? Like a `DateTime` object does? I think it might be useful to have a function returning a collection of information about text, how many unicode points, how many grapheme clusters, how many syllables, vowels, consonants, special characters… But for performance reasons you probably want separate functions that give you just one of these.


DateTime, Instant, and Duration are totally different things. I believe I was thinking of a class I remember seeing added to java at one point. Time is probably one of the few things in programming that's even more cursed than strings.


String iteration should be based on whatever you want to iterate on - bytes, codepoints, grapheme clusters, words or paragraphs. There's no reason to privilege any one of these, and Swift doesn't do this.

"Length" is a meaningless query because of this, but you might want to default to whatever approximates width in a UI label, so that's grapheme clusters. Using codepoints mostly means you wish you were doing bytes.


> There's no reason to privilege any one of these, and Swift doesn't do this.

Strange thing to say: Swift String count property is the count of extended grapheme clusters. The documentation is explicit:

> A string is a collection of extended grapheme clusters, which approximate human-readable characters. [emphasis in original]


The length/count property was added after people asked for it, but it wasn't originally in the String revamp, and it provides iterators for all of the above. .count also only claims to be O(n) to discourage using it.


That was almost seven years ago now. It has been the String API twice as long as it has not been the API.


> thing that gets deleted when you hit backspace

Is there a canonical source for this part, by the way? Xi copied the logic from Android[1] (per the issue you linked downthread), which is reasonable given its heritage but seems suboptimal generally, and I vaguely remember that CLDR had something to say about this too, but I don’t know if there’s any sort of consensus on this problem that’s actually written down anywhere.

[1] https://github.com/xi-editor/xi-editor/pull/837


> And for this reason, String iteration should be based on codepoints

Why not offer both and be clear about it? Rather than just "length", why not call them code points? The Python docs for "len" which can be called on a unicode string say "Return the length (the number of items) of an object.". It doesn't look like a clear and easy to use API to me.


If you insist that `len` shouldn't be defined on strings, and the default iterator should be undefined in python then:

  for c in "Hello":
    pass
should throw an exception. Also

  if word[0] == 'H':
     pass
should throw an exception.

This would have been an extremely controversial suggestion when python3 came out to say the least.

Codepoints is a natural way of defining unicode strings in python, and it mostly works the way you expect once you give it a bit of thought. It is lower level than, say, grapheme clusters, but its more well defined and it provides the proper primitives for dealing with all use cases.


I would suggest that len works as the article suggests; and "Hello".codepoints gives the behaviour you want.


> for this reason, String iteration should be based on codepoints

This leads you to the problem where you'll get different results iterating over

n a ï v e

vs

n a ̈ i v e

And I can't see how that's ever going to be a useful outcome.

If you normalize everything first, then you can sidestep this to some degree, but then in effect your normalization has turned codepoint iteration into grapheme iteration for most common Latin-script text characters.


This is quite a good write up. An answer to one of the author's questions:

> Why does the fi ligature even have its own code point? No idea.

On of the principles of Unicode is round trip compatibility. That is you should be able to read in a file encoded with some obsolete coding system and write it out again properly. Maybe frob it a bit with your unicode-based tools first. This is a good principle, though less useful today.

So the fi ligature was in a legacy encoding system and thus must be in Unicode. That's also why things like digits with a circle around them exist: they were in some old Japanese character set. Nowadays we might compose them with some zwj or even just leave them to some higher level formatting (my preference).


> they were in some old Japanese character set

This implies that they're obsolete, but they're not -- they're still in very common use today. You can type them in Japanese by typing まる (maru, circle) and the number, then pick it out of the IME menu. Some IMEs will bring them up if you just type the number and go to the menu, too. :)


Fair enough. I was thinking of them as obsolete, but shouldn't since you do see them a surprising amount in Japan.


What do the Japanese use the circled numbers for?


Ordinals and references, like this:

① Draw some circles

② Draw the rest of the owl

Commentary: ① is simple but ② is masking many complex steps that are necessary to draw an owl.


> So the fi ligature was in a legacy encoding system and thus must be in Unicode.

Most of the pre-composed latin ligatures are generally from EBCDIC codepages. People in the ancient Mainframe era wanted nice typesetting too, but computer fonts with ligature support were a much later invention.

You can see fi and several others directly in EBCDIC code page 361:

https://en.wikibooks.org/wiki/Character_Encodings/Code_Table...


Thanks. Some alphabets have precomposed ligatures that aren't really letters, like old German alphabets with tz, ch, ss (I only know how to type the last one, ß, because the others have died out over the last hundred years).

Actually in German (at least) ä, ö and ü really are actually ligatures for ae, oe, and ue -- the scribes started to write the E's on their sides above the base letters, and over time the superscript "E"s became dots or dashes. Often they are described the other way around: "you can type oe if you can't type ö." That's what my kid was told in school!

But Ö and ß aren't really part of the alphabet in German, while, say, in Swedish, ä and ö became actual letters of the alphabet. English got W that way too.


That's sounds a bit false to me. The Umlaute (ä,ö, ü) and the "eszett" ß are actually part of the German alphabet[1]. Also it is kinda weird to describe them as ligatures of the original letters and the diaeresis, because while this is what they started out as a long time ago, they are just their own letters now (as opposed to "real" stylistic ligatures like combining fi into one glyph). The advice your kid was told that they can be replaced with ae, oe and ue is correct - it is a replacement nowadays.

[1] https://de.wikipedia.org/wiki/Deutsches_Alphabet


C'mon that page is highly technical, really just listing the letters or glyphs used for forming printed text. In reality, if you walk into any first grade classroom you see pictures of the letters A-Z with pictures (Apfel, Bär, uwv) and then after the end maybe around the corner, what, Öl? I can't even remember. When the kids recite the letters they don't recite äöüß. TBH I really only remember this because when kiddo was that age the Neue Rechtscribung transition was mid-process and the parents were angrily divided so I was kinda paying attention.

Also, though it's hardly authoritative, my kids' school taught English through immersion from grade 1 too, and both German and English teachers said "same alphabet".

As bmicraft pointed out, even in that wikipedia chart those inflected letters are spaced apart from the others. Yes, they are letterforms, but not part of the "alphabet" -- they don't even have a sorting like the Swedish Ä or W do.

And you can switch in running text from using the marker for umlaut (dots or bar, not semantically dieresis) or a normal "e" without anyone blinking. There's no problem reading a Swiss book even though ß refuses to cross the border. Though I personally prefer to read Äpfel and Bär rather than Aepfel and Baer, really, they are the same.


> When the kids recite the letters they don't recite äöüß.

On a somehow related side note, I read that "&", which is derived from a ligature of the Latin conjunction "Et" (meaning "and"), was named "ampersand" in English as a mondegreen for "and per se and" as it used to be placed at the end of the English alphabet recitation.


They sure are letters, but they aren't generally thought of as being in the alphabet (which seems to be why they are just kinda tacked on after a space on wikipedia) and get ordered as if they where just the base letter (mostly)


Note that in Swedish they are considered letters, and in Danish and Norwegian Æ, Ø and Å are letters.


Letters which are sorted separately from what we'd think of as the base characters in English (they appear at the end of the alphabet, as W X Y Z Æ Ø Å, with C often omitted in Norwegian).

By contrast, my French dictionary has énorme nestled between enorgueillir and enquiérir. (Looking for an example does underscore some of the patterns in the language: page after page of ét~ with only a few et~ and one êt~ among them; pages of ex~ with no éx~ at all.)


Similarly in Swedish, W was not considered a letter but just a variant of V, so in phone books etc all the W names were mixed in with V names. This was changed in 2006 due to an increase in English loanwords.


> old German alphabets with tz, ch, ss (I only know how to type the last one, ß, because the others have died out over the last hundred years)

They still ꜩ on some German street signs. I can't find ch in Unicode though (could just be my old eyes).


You sent me on an enjoyable wild goose chase but it appears that only ß and ff are in unicode: tz, ch, ck have to be handled completely in rendering.

I have my music teacher's German schoolbook from around 1915 and it lists them all as letters (the whole book is in Fraktur). I have various old books in Fraktur and once could read them. I imagine that if I sat down and tried to read one it would come back, but at the moment I have to thin a little to read just the titles!


The circled digits as code points are very nice to have precisely because they are available in applications that don't support them otherwise... which is actually most of the software I can think of (Notepad, Apple Notes, chat applications, most websites, etc).


My point was that, had they not been legacy characters (or had RT compatibility been disregarded) Unicode could still have supported them as composed characters. Though I personally still feel they are a kind of ligature or graphic, but luckily for everyone else I’m not the dictator of the world :-).

We should be careful: someone on HN could write a proposal that they should be considered pre composed forms that should also have an un-composed sequence… so there could in future be not just 1 in a circle but 1 ZWJ circle, circle ZWJ 1 all considered the same…I can imagine some HN readers being pranksters like that.


Can you write them with iOS keyboard? Or when you say Apple Notes and chat apps you just mean from desktop?

Edit ①: seems the answer is not with the default iOS keyboard, but possible to paste it and perhaps possible with a third party keyboard that I'm not keen on trying (unless I hear of a keyboard that's both genuinely useful / better than default, and that doesn't send keystrokes to the developer - though I can't remember if the latter is even a risk on iOS, better go search about that next..)


Speaking of third party keyboards, I’m still upset about what happened to Nintype[0]. I’ve never ever been able to type faster on mobile than with it’s intuitive hybrid input style of sliding and tapping, paired with AI that was actually good. It used to be quite performant, fully customizable, and it worked beautifully as a replacement for default on jailbroken iOS.

Today, it’s buggy $5 abandonware that only makes me sad when I am reminded of it.

EDIT: Here[1] is a blog post that claims it's still the best keyboard in 2023. I actually might give it another shot... Not holding my breath though.

*EDIT: Looks like another dedicated fan has actually taken it upon themself to revive the project, under the new name Keyboard71[2].

[0] https://apps.apple.com/us/app/nintype/id796959534

[1] https://maxleiter.com/blog/nintype

[2] https://www.reddit.com/r/keyboard71/


I just want to say thank you for introducing me to Keyboard 71. I've never heard of Nintype, but this thing is incredible!


I wonder why they haven't open sourced their fork, over than vague worry it might get DMCA'd


Nintype was absolutely incredible. I still open it every now and then after an iOS update in the vain hope some system change made it less buggy.


I’m really considering repurchasing (I definitely owned it previously, no idea what happened), can you describe specifically what the main bugs are for you? I’d be happy if I could use it solely for occasionally writing long notes, not as a replacement for all text inputs.

Really not looking to burn another $5, I’d greatly appreciate any thoughts/concerns at all.


You can type ① with the UniChar keyboard app on iOS. It at least claims it doesn’t transmit information. As it’s only useful for special characters I don’t worry because I can’t use it for normal typing anyway.

https://unichar.app


No third party keyboard transmits information without you permitting it.


The default iOS Japanese keyboard allows easily entering circled numerals and many other “exotic” characters.


Yeah many "exotic" characters were introduced to Unicode from Japanese legacy computers 〠〄


If you use iOS Japanese romanji keyboard, typing "maruichi" will give you all the options.


You can copy/paste them from a character board, a dedicated website, or even the wiki.


> That's also why things like digits with a circle around them exist: they were in some old Japanese character set.

Replace "digits with a circle around them" with "emojis" and that's also true.


Is it just me, or is anyone else seeing what looks like the mouse pointer of everyone else reading the page, like 1,000 little ants on the screen


Anytime tonsky's site gets posted here, I'm reminded by how awful it is, which is ironic given his UI/UX background. The site's lightmode is a blinding saturated yellow, and if you switch into darkmode, it's an even less readable "cute" flashlight js trick. I don't know why he thought this was a good idea. Thank god for Firefox reader mode.


I don't think he added moving cursors all over the page because he thought it was good UI/UX, he knows what he is doing.


I'm having a hard time reconciling "he knows what he is doing" with him making his site practically unusable without a reader mode, which by the way, not every browser supports (especially on mobile).


Don't even think of switching on the dark (night) mode with that attitude! :D

I really enjoyed the tongue in cheek design. I think every modern browser either allows you to turn on reader mode (especially on mobile) or just turn off CSS. This particular article works excellently even in w3m.


Fine, sure. Cute - turn on reader mode. Now the images that are supposed to be sitting over yellow background are dark gray images over a slightly less dark gray background.

The decision to design a serious (read: not-tongue-in-cheek) topic with these "quirky" tricks sucks, JMHO.


> Don't even think of switching on the dark (night) mode with that attitude! :D

:D

> I really enjoyed the tongue in cheek design. I think every modern browser either allows you to turn on reader mode (especially on mobile) or just turn off CSS. This particular article works excellently even in w3m.

Firefox Focus on mobile does not have Reader Mode.


Fair enough. Tbh at a different time I would probably be as pissed at the author as anyone. Might be mood dependent.

I should also add that the night mode won't be as fun on mobile either, just checked and I don't think it works with pointer events for the effect


He appears (if his logos are anything to go by) to be a flat UI guy. I doubt any of these people know what they're doing.


This is seemingly self-contradictory. Perhaps you could explain your reasoning further?


Doing bad things is their idea of fun.


You gotta know the rules to bend the rules


It lets you hold hands with strangers


It's called satire.


It's deeply ironic that an article about dealing with text properly has images which are part of the article text and yet have no alt-text, rendering parts of the article unreadable in reader mode if the server is slow.


It is obviously a joke (and a good one, I dare say). The fact that people seem to take it seriously says something about the contemporary state of webdesign :)


It would be a better joke if there were an option to turn the joke off. As it is, dark mode doesn't exist and the pointers occlude text.


> It would be a better joke if there were an option to turn the joke off.

As others have pointed out, reader mode works as expected.


1. Not every browser has reader mode

2. I don't think it's a very good joke to post long-form content on your blog with the expectation that it's basically unreadable without a reader mode.

> The fact that people seem to take it seriously says something about the contemporary state of webdesign :)

Mind expanding and what it says exactly about contemporary web design?

Whether I take it seriously or not doesn't change the fact that it's still damn hard to read anything.


> Mind expanding and what it says exactly about contemporary web design?

The same as when political satire is indistinguishable from actual politics. It means that the real things has sort of become a joke itself.


I can agree that most modern web design is bad. I can also agree that the web design on tonsky's site is bad, but OK, I acknowledge that it is intentionally bad; so bad that it's unreadable. I had myself a chuckle now, and next time I see a link to tonsky's site, I'll click on it, chuckle, and immediately leave.


If you don't have reader mode, get a new browser. Don't tell him to make it boring for all the rest of us who behave normally.


Boring and readable are not the same thing. Also, you can edit your comments on HackerNews


Also, I read it perfectly well and never thought about switching to reader. It made me laugh.


Works like a normal website with JavaScript disabled. I didn't even know it did fancy junk until reading the comments here. NoScript saves the day again! I don't know how people can browse the web without it.


I never understood how people can browser the we WITH IT. Even 10 years ago. today more then ever basically every website needs JS to work properly. I basically never come across a page where I have the urge to disable JS. I have a large list of adblock lists active that also help getting rid of cookie banners and other shit.

I can not imagine manually approving JS for every site. And again doing the inverse and have noscript installed to deny one website a year does not seem to be worth it for me. In this case I can also just use a adblock rule to block that specific script or all .js files from the domain I guess. So I really no not need NoScript.


Many sites don't need JS at all, like the OP of this thread for example. For a lot of sites, disabling JS actually gives a better experience than leaving it enabled, again like the OP of this thread. It's a trade-off, but I find most uses of JS are so bad it's worth putting up with whitelisting. For example, I don't see cookie pop-ups, I don't see videos, disabling JS kills most of those stupid sticky headers that web designers love so much, and whatever too-clever crap the OP of this thread was doing is completely bypassed. The web is so, so much better with JS off by default.

For those sites that do need JS, NoScript's whitelist feature makes it quick & easy to fix. The first time I visit a new site, if it is obviously broken, then I whitelist the main domain. If that doesn't fix it, then I whitelist a couple likely-looking domains (often sites import JS from similar domains, or from common library domains). That's enough to get probably 90+% of websites working, while still leaving most garbage JS disabled. The remaining ~10% of websites that need a dozen domains whitelisted are probably not worth visiting anyway, so I just move on at that point. Or NoScript even lets you temp-whitelist everything for a given tab and just put up with the misery to get whatever I need from that one site. Since the whitelist persists forever, and I don't visit hundreds of different websites every day, after some time it becomes pretty rare that I need to whitelist more than one or two things per day.

You maintain an adblock blacklist, I maintain a NoScript whitelist. Not so different :)


By default the script for the page itself is whitelisted, it is just the third party scripts that are blocked. This works fairly often, but there are a few sites that you can also globally unblock because they provide value. One example is mathjax, used to format equations on many pages.


It is some time ago since I last used it, but I found that too many websites that I want to read require Javascript to even show you the main body of text, or a reasonable layout. Is that different now?


No. Just whitelist the main domain for sites that are obviously broken. Then try one or two likely subdomains if that's not enough. In the rare cases where it still wants more crap enabled, then it's usually not worth the effort, close tab and move on to something else. As you build up a whitelist over time, it becomes pretty rare that you need to interact with it more than a couple times per day. Yeah, it takes some effort, but it's worth it to nuke cookie banners and sticky headers and videos and all the other crap people do with JS.


I already have that routine with uBlock Origin. I don't think NoScript offers all of uBO's functionality, and I certainly won't do the same dance for two extensions, but I'll look into uBO's abilities to specifically block JS.


Makes sense! I use both, uBO just does its thing and I never interact with it. NoScript handles blocking & whitelisting javascript. It's totally possible uBO has a similar feature and I just don't know about it.


By using reader mode


>it's an even less readable "cute" flashlight js trick. I don't know why he thought this was a good idea. Thank god for Firefox reader mode.

not even a proper flashlight. it updates when the mouse moves, so you're SOL if you scroll on desktop.


Well, I thought it was fun.


I'd say this annoying trick is highly appropriate for the topic!


Yep, the website opens a websocket connection[0] and sends the mouse position every 1 second

[0] WS connection is on `wss://tonsky.me/pointers?id=XXXXXX&page=/blog/unicode/&platform=XXX`


I've been drawing circles for over a minute now and no one has joined me yet, so I conclude those movements are random rather than made by intelligent beings. :)


I did the same for a while while I was reading. From another comment, the position only seems to update once a second, so it'll be hard for someone to notice your movements.


That makes me think of this old gem https://imgur.com/gallery/BgKFcI9


It's quite possibly the worst web page presentation I've come across in a long time - aside from the fact it looks like some bug has caused my OS to leave a random trail of mouse pointers all over the screen, some of them even move around, making me doubt my sanity when I'm quite sure I'm holding the mouse still. And the less said about the colours the better. There's no way I was going to put up with that long enough to read all the text on it.


Good times. If you click on the sun switch the entire UI gets zeroed out and you get to use on:hover mouse shtick to read the UI through a fuzzy radius. Is Yoko Ono designing websites now?


It's a joke. It made me laugh.


It's a bad joke. It made me close the browser tab.


> It's a bad joke.

To each their own.

> It made me close the browser tab.

If you can't handle refreshing or merely clicking it again, that's you having a problem, not the site having a problem.


> If you can't handle refreshing or merely clicking it again, that's you having a problem, not the site having a problem.

No, it's the site's problem. The contrast between the blinding radioactive yellow background and the font is eye straining and doesn't meet the WCAG standards for accessible text. And the dark mode is unusable. The joke would be funnier if there was a real dark mode or if the light mode was readable.


You said the joke made you leave. The normal color scheme is not part of the joke, and I'm sorry it hurts your eyes. I won't try to defend the eye hurting.

> doesn't meet the WCAG standards for accessible text.

Oh, which part of the standards? When I punch #000000 on #FDDB29 into contrast checkers I get good results.

The single quote isn't as good but all the rest has those colors.


I'm using the Firefox Accessibility Tools. But you're right, I mistook the accessibility warning for the quote/header text for the body text. #000 > #FDDB29 does pass unfortunately.


Boo! I enjoyed it a ton. More fun than another bland sleek web page.


It's a creative and fun website, just not nice to use.


turned off javascript as soon as I saw it. Like trying to read with twenty mosquitos in your face.


hey be nice to my mouse cursor


If you're using firefox, toggling reader view should do the trick.


I see nice crisp black text on white background because apparently server melted down


I saw that, except half the images weren't loading, and there was just one mouse pointer.


yeah....why on earth would someone want their webpage to do this, especially if they have text they'd presumably want you to read?


It's cute, and provides a hint of human connection that is otherwise absent on the web "hey, another human is reading this too!" which you probably know but something about seeing the pointer move makes it feel real.

Probably not the greatest during a hacker news hug of death, but if I read that article some other time and saw one of the moving pointers, I would think it was really cool.


Have you ever read with other people, like in school or a book club, or been somewhere that there were other people around? It's an interesting move by the author; the loneliness epidemic hasn't gone unnoticed.

eg https://www.npr.org/2023/05/02/1173418268/loneliness-connect...


Too bad the Linux and the Mac pointer look so similar. But when you give them different background colors, it becomes more obvious which platform dominates, like:

  .pointer.l {
    background-color: green;
  }


Distracted me from reading the article, I just started chasing other people around.


That's what's missing. When I click on a pointer, its owner should have the article replaced with a "GAME OVER" message.


I know which site you are talking about before even clicking the article :(


It's fun specially for folks like me who have ADHD. But there should be a button to disable it


Yes, reading the article is impossible with erratic movement on the screen.


as someone with a visual processing disorder, this is like having a page scream at me. Repeatedly. Never do this


yup, pretty annoying


Yeah, it's extremely obnoxious.


first thing I did before reading the article, using uBO to block JS on the page


not just you, this is what my other comment is about (indirectly)


> People are not limited to a single locale. For example, I can read and write English (USA), English (UK), German, and Russian. Which locale should I set my computer to?

Ideally - the "English-World" locale is supposedly meant for us, cosmopolitans. It's included with Windows 10 and 11.

Practically, as "English-World" was not available in the past (and still wasn't available on platforms other than Windows the last time I checked), I have always been setting the locale to En-US even though I have never been to America. This leads to a number of annoyances though. E.g. LibreOffice always creates new documents for the Letter paper format and I have to switch it to A4 manually every time. It's even worse on Linux where locales appear to be less easy to customize than in Windows. Windows always offered a handy configuration dialog to granularly tweak your locale choosing what measures system you prefer, whether your weeks begin on sundays or mondays and even define your preferred date-time format templates fully manually.

A less-spoken about problem is Windows' system-wide setting for the default legacy codepage. I happen to use single-language legacy (non-Unicode) apps made by people from a number of very different countries. Some apps (e.g. I can remeber the Intel UHD Windows driver config app) even use this setting (ignoring the system locale and system UI language) to detect your language and render their whole UI in it.

> English (USA), English (UK)

This deserves a separate discussion. I doubt many English speakers (let alone those who don't live in a particular anglophone country) care to distinguish between English dialects. To us presence of a huge number of these (don't forget en-AU, en-TT, en-ZW etc - there are more!) in the options lists brings only annoyance, especially when one chooses some non-US one and this opens another can of worms.

By the way I wonder how do string capitalization and comparision functions manage to work on computers of people who use both English and Turkish actively (Turkish locale distinguishes between dotted and undotted İ).


As an Irish person, while we have en_IE which is great (and solves most of the problems you list re: Euro-centric defaults + English), I'd still quite like to have an even more broad / trans-language / "cosmopolitan" locale to use.

I mainly type in English but occasionally other languages - I use a combination of Mac & Linux - macOS has an (off-by-default but enable-able) lang-changer icon in the tray that is handy enough, but still annoying to have to toggle. Linux is much worse.

Mac also has quite a nice long-press-to-select-special character that at least makes for accessible (if not efficient) typing in multiple languages while using an English locale. Mobile keyboards pioneered this (& Android's current one even does simultanous multi-lang autocomplete, though it severely hurts accuracy).

---

> I doubt many English speakers care to distinguish between English dialects.

I think you'll find the opposite to be true. US English spellings & conventions are quite a departure from other dialects, so typing fluidly & naturally in any non-US dialect is going to net you a world of autocorrect pain in en_US. To the extent it renders many potentially essential spelling & grammar checkers completely unusable.


I write in multiple languages daily on Linux, including English, Russian, and Chinese. Switching keyboards (at least with gnome) is a simple super-space.

While in my default (English) layout, it is easy enough to add in accents other characters using the compose key (right alt). So right-alt+'+a = á or right-alt+"+u = ü. I much prefer this over the long press as I can do it quickly and seamlessly without having to wait on feedback. Granted, it is not as discoverable, but once you are comfortable, it in my opinion is a better system.


> So right-alt+'+a = á or right-alt+"+u = ü.

Not for me!

Right-alt+'+a = â

Right-alt+'+u = û


I can 2nd this as an American who now resides in Europe. My first laptop I brought with me, and was defaulted to en_US, but my replacement is en_GB (Apple doesn't have en_NL, for good reason).

I don't find it "unusable", though. I could change it back to en_US, but it has actually been interesting to see all of my American spellings flagged by autocorrect. Each time I write authorize instead of authorise it is an act of stubborn group affinity!


> US English spellings & conventions are quite a departure from other dialects.

As far as the written, formal language is concerned, English really has only three dialects: US American, Canadian, and everywhere else. There are some other subtle differences (such as "robots" for traffic lights in South Africa, or "minerals" for fizzy drinks in Ireland¹), but that's pretty much it.

¹ Yes, this isn't just slang in Ireland: the formal, pre-recorded announcements on trains use it: "A trolley service will operate to your seat, serving tea, coffee, minerals and snacks." The corresponding Irish announcement renders it mianraí. Food service on trains stopped during covid and has not yet resumed, so I'm working from distant memory now.


> As far as the written, formal language is concerned, English really has only three dialects

This is true, but I don't see why the "formal" qualifier is needed here :) There are much more than 3 dialects of English, both written & spoken.

Especially there's a fair few extremely common notable differences in (casual, written) Irish English: the word "amn't" (among other less common contractions), the alternative present tense of the verb "to be" (i.e. "do be"), various regional plurals of "you", and - perhaps the most common - prepositional pronouns, etc. etc.


Well, quite. If we include any one or more of the following three categories — formal spoken language, informal spoken language, informal written language — then there's definitely far more than three dialects of English. But formal spoken language really has only the three.


I guess it's a question as to how many varieties of spelling you want to make available as "translations" in software (e.g. color vs colour, tire vs tyre).

There's plenty of regional variants just within the US, but "en_us" covers the whole country.


That's a fair point - even in tiny tiny Ireland there's many regional dialects, with larger countries there'll typically be far more.

I guess the simple answer to that is: how much interest is there in maintenance. I don't think there's any compelling reason not to create something: if there's insufficient interest in maintenance that's an imperfect but reasonable proxy for utility.

I'm not aware of any maintained en_US_Xyz languages but it might be pretty cool if someone started. There's precedence in a few other languages, like no_NO_NY, zh_Hans_HK, etc.


For both Norwegian and Chinese there are two separate written standards, along with the far more varied spoken dialects/languages.


> I doubt many English speakers care to distinguish between English dialects

It's worthwhile purely for the sake of autocorrect/typo highlighting in text-editing software. I don't miss the days of spelling a word correctly in my version of English but still being stuck with the visual noise of red highlighting up and down the document because it doesn't conform to US English.


Yeah I'd rather not have my British English dialect seen as second-class in a world of American English ideally which is what having a red document full of 'errors' implies in those sorts of situations.

It's sometimes not a trivial distinction either, for example I've heard of cases where surprised British redditors have found themselves banned from American subreddits for being homophobic when they were actually talking innocently enough about cigarettes!


I would think a lot of mods, who are either Highly Online Americans or their weirdo equivalents in other countries, are well aware of the UK usage, but simply expect Brits to give it up in order to avoid offending Americans and the global Reddit community that largely takes American-style sensitivity as its orthodoxy. And considering that Reddit corporate feels that anything that could stir up such outrage is bad for business, mods of popular subreddits may well feel pressured to come down hard on these matters.


It doesn't matter if you use UK or US spelling you are wrong. I wish we would adopt the international phonetic alphabet I might have a chance of spelling things correctly.


> I doubt many English speakers care to distinguish between English dialects

I think you'd be surprised how many english (UK) people will get pissed off when their spell-checker starts removing the "u" from colour or flavour, or how many English (US) people get pissed off when the spellchecker starts suggesting random "u"s to words.

additionally to that, locale isn't just about language. English (US) and English (UK) decides whether your dates get formatted DD-MM-YY or MM-DD-YY, whether your numbers have the thousands broken by commas or spaces, and a host of other localization considerations with a lot more significance than just the dialect of english.


I'd really like an en-GB-oxendict (British English but favouring -ize over -ise) locale for formal writing.


I worked for BP for a while (well, as a contracted coder) and I got quite used to the UK spell check correcting everything to its idiom. Everything seemed wrong once I returned a world that dismissed the value of the letter 'U' and preferred the letter 'Z' over 'S'. Also missed the normalizing of drinking beer at lunch.


> Also missed the normalizing of drinking beer at lunch.

Perhaps you're an old-timer? I worked in the city in the early 80s; lunch in the pub was routine, and sometimes required. By the end of the 80s, that was at best frowned on. Over the last 20 years, having alcohol on your breath after lunch would have been a disciplinary issue, unless you were entertaining a client, at least in the places I worked.


It's likely because I was in the exploration frontier of Alaska. This was late nineties, probably an operation run by people who worked in the city in the '80's and who could continue the old ways far away from the social glare of the head office. :D


> I have always been setting the locale to En-US even though I have never been to America. This leads to a number of annoyances though. E.g. LibreOffice always creates new documents for the Letter paper format and I have to switch it to A4 manually every time

> I doubt many English speakers (let alone those who don't live in a particular anglophone country) care to distinguish between English dialects. To us presence of a huge number of these (don't forget en-AU, en-TT, en-ZW etc - there are more!) in the options lists brings only annoyance, especially when one chooses some non-US one and this opens another can of worms.

Well, you just explained what this plethora of options is about. It's not just about how you spell flavor/flavour. It's a lot of different defaults for how you expect your OS to present information to you. Default paper size, but also how to write date and time, does the week start on Monday, Sunday, or something else, etc.


> Practically, as "English-World" was not available in the past (and still wasn't available on platforms other than Windows the last time I checked), I have always been setting the locale to En-US even though I have never been to America. This leads to a number of annoyances though. E.g. LibreOffice always creates new documents for the Letter paper format and I have to switch it to A4 manually every time. It's even worse on Linux where locales appear to be less easy to customize than in Windows. Windows always offered a handy configuration dialog to granulatly tweak your locale choosing what measures system you prefer, whether your weeks begin on sundays or mondays and even define your preferred date-time format templates fully manually.

There's the English (Denmark) locale for that on some platfoms.


Thank you very much, I'll give it a try.


It's a bit of a joke that doesn't have universal support. Works on my phone. Apparently you can also try en_IE (Ireland).

https://unix.stackexchange.com/questions/62316/why-is-there-...


en-GB is also a good choice


Not really, no € and no metric units


I write daily in US English, Australian English, and Austrian German. Most of the time, a specific document is in one dialect/language or another: not mixed, although sometimes that's not true.

I can understand that the conflation of spelling, word choices, time and date formatting, default paper sizes, measurement units, etc, etc, is convenient, and works a lot of the time, but it really doesn't work for me at all.

That said, I appreciate that I occupy a very small niche.


> I doubt many English speakers (let alone those who don't live in a particular anglophone country) care to distinguish between English dialects.

Most people in the UK care - a population nearly twice that of California, and larger than the native speakers of any non-top-20 language. If you care enough to support e.g. Italian you should support en_UK.


> English (USA), English (UK)

> This deserves a separate discussion. I doubt many English speakers (let alone those who don't live in a particular anglophone country) care to distinguish between English dialects.

While that is generally (though not always) true, I would assume it's really a stand in for the much more relevant zh locales.

It is also rather relevant to es locales (america spanish has diverged quite a bit from europe spanish hence the creation of es-419), definitely french (canadian french, to a lesser extend belgian and swiss), and german (because swiss german). And it might be relevant for ko if north korea ever stops being what it is.


Unix style locale (set by env vars) is flexible, that can set per app. Android and iOS now support locale per app recently IIRC. Windows locale settings is global and some requires reboot.


As much as I appreciate that I always wondered how many programs actually respect all those tweaks.


i İ

ı I

I symphatize with people who get this wrong. (I just saw some YouTube video have a title TÜRKIYE in a segment)

Even google keyboard can't seem to distinguish between I and İ. When I type "It", it suggests "İt's" which is quite pathetic.


> This deserves a separate discussion. I doubt many English speakers (let alone those who don't live in a particular anglophone country) care to distinguish between English dialects. To us presence of a huge number of these (don't forget en-AU, en-TT, en-ZW etc - there are more!) in the options lists brings only annoyance, especially when one chooses some non-US one and this opens another can of worms.

I definitely do. The biggest difference, as everyone else, has pointed out is the US vs UK spellings.

Realistically, though, beyond that country is a poor indicator for everything else. I want to use DD/MM/YYYY date format in English, but DD.MM.YYYY date format in German. I want to use $1,000 in English, but 1.000 $ in German. This isn't dependent on the country I live in, this is dependant on a combination of a country and language - that could be the country I'm living in, or the country I grew up in (mostly US date format vs not), and it's either the language I'm actively typing in, or the language of the document I'm reading, or the language I'm thinking in (but a computer can't exactly handle that).

Trying to guess the correct combination is tricky, especially if a document is in two languages (e.g., a quotation), and users are lazy and won't switch their IME unless they have to.

What this means in slightly more practical terms is that setting a single "locale" for my device doesn't make sense, but rather I should be able to choose a locale per language (or possibly spelling preferences by language and formatting options by language as separate choices). I'd then pick a language to use the device in, and it would use that languages locale, and tell apps that this language is the preferred language. If an app doesn't provide my preferred language, pull the preferred locale from settings for a language it does support, otherwise use the default set by developers. For some apps, it's a bit more complex, particularly if I'm creating content. GMail or Office would be two good examples, where the UI language might be in English, but the emails or documents are in German, or a combination of German and English.

Even then, I'm sure there are people who need something even more flexible than that.

At the moment, if I set my language to English but my Country to Germany on my iPhone, for example, things occasionally get confused. My UK banking app, for example, pulled the decimal separator from my locale settings for a while and then refused to work because "£9,79" (or whatever it was) isn't a valid amount of money, and I couldn't see a way to fix that without switching my Country in the phone settings. I imagine they fixed it by ignoring my configured locale and always using en-GB, thus defeating the whole point of a locale in the first place.

So yeah, these days it's fairly common to not have a single "locale" that you work in - it's quite possible to want to use two or more but nothing is really set up to handle that well.


> many Chinese, Japanese, and Korean logograms that are written very differently get assigned the same code point

This leads to absolutely horrendous rendering of Chinese filenames in Windows if the system locale isn’t Chinese. The characters seem to be rendered in some variant of MS Gothic and it’s very obviously a mix of Chinese and Japanese glyphs (of somewhat different sizes and/or stroke widths IIRC). I think the Chinese locale avoids the issue by using Microsoft YaHei UI.


What on EARTH is that mouse cursor thing all about? Why would you even bother writing this, then making it impossible to read properly?


It's tracking every visitors' cursor and sharing it with every other visitor.

Why would a frontend developer demonstrate their ability to do frontend programming on their personal, not altogether super-serious blog? I meant that rhetorically but it's a flex. I agree, not the best design in the world if you're catering for particular needs, but simple and fun enough. You should check out dark mode.

In that vein, I think it's okay if we let people have fun. That might not work for everyone, but why should we let perfect be the worst enemy of fun?


> Why would

because it shows that they don't understand important design aspects

while it doesn't really show off their technical skills because it could be some plugin or copy pasted code, only someone who looks at the code would know better. But if someone care enough about you to look at your code you don't need to show of that skill on you normal web-site and can have some separate tech demo.

> okay if we let people have fun

yes people having fun is always fine especially if you don't care if anyone ever reads your blog or looks at it for whatever reason (e.g. hiring)

but the moment you want people to look at it for whatever reason then there is tension

i.e. people don't get hired to have fun

and if you want others to read you blog you probably shouldn't assault them with constant distractions


> people don't get hired to have fun

Living by that motto is hugely self-destructive.

Creative expression allows us to push ourselves, both in what we think we can do, and often the technical aspects about how we do it too. Even if the idea doesn't stick, you've tried something new.

In a world of Tailwinds and Bootstraps and the same five templates copied again and again and again, let's celebrate the people willing to push things and learn from their inevitable but ultimately valuable mistakes. And let's have some fun along the way.


Not every website, even technical ones, need to have an eye towards professional advancement. Sometimes they're just for fun. I welcome it, as it's a thing that gets more rare on the web as time goes by.


Considering the dark mode is effectively flashlight mode, I think it's reasonable to assume the blog's owner just likes to have a bit of fun.


I assume the creator didn't anticipate this amount of readers at the same time and having one or two other cursors on the page does sound fun and not too distracting. They should probably limit the maximum amount of other cursors displayed to a sensible amount


Lighten up.


I stopped in the middle of reading the post just for this. It was so distracting I was unable to focus on the text. It's a fun gimmick, but the result is that someone who wanted to read the post, stopped in the middle.


(sarcasm)

It's revenge against anyone with certain kinds of visual impairments and/or concentration issues because the ex-spouse of the author which turned out to be a terrible person had such.

(sarcasm try 2)

It's revenge against anyone using JS on the net with the author trying to subtle hint that JS is bad.

(realistic)

It's probably on of:

- the website is a static view of some collaborative tool which has that functionality build in by default

- some form of well intended but not well working functionality add to the site as it was some form of school/study project, in that case I'm worried about the author suffering unexpected very much higher cost due to it ending up on HN ...


Hi, author here. In case you really want to know: no, it’s custom-made and works exactly as intended. There are two main reasons:

1. Fun. Modern internet is boring, most blog posts are just black text on white background. Hard to remember where you read what. And you can’t really have fun without breaking some expectations.

2. Sense of community. Internet is a lonely place, and I don’t necessarily like that. I like the feeling of “right now, someone else reading the same thing as I do”. It’s human presence transferred over the network.

I understand not everybody might like it. Some people just like when things are “normal” and everything is the same. Others might not like feeling of human presence. For those, I’m not hiding my content, reader mode is one click away, I make sure it works very well.

As for “unexpectedly ended up on HN”, it’s not at all unexpected. Practically every one of my general topic articles ends up here. It’s so predictable I rely on HN to be my comment section.


I like your content but I do think you need to rethink #1. Fun is usless if no one wants to show up because they are annoyed.


Count me too to the group of "I was so distracted that I stopped reading."

Then the second thought was: I should again start to block js by default as much as I can.


2. I only understood that it was actual other people's mouse cursors when I read that here. So it didn't really engender a sense of community, although after some time I did think you are very good at modelling actual human mouse movements. Now that I know it, it's pretty neat though.


Same. I didn't realize it was other people until I came here. Then... I went back to the page and had fun trying to follow other people's cursors.


The author has several other writeups:

https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...

The cursors will only be a problem during front page HN traffic. And the opt-out for people who care is reader mode / disable js / static mirror. Not sure if there's any better way to appease the fun-havers and the plain content preferrers at the same time. Maybe a "hide cursors" button on screen? I, for one, had a delightful moment poking other cursors.


I don't know what you people are talking about. I'm just glad I always browse with Javascript turned off. If you didn't see the writing on the wall and permanently turn Javascript off around 2006, you have no right to complain about anything.

Meanwhile, ironic irony is ironic: "Hey, idiots! Learn to use Unicode already! Usability and stuff! Oh, btw, here is some extremely annoying Javascript pollution on your screen because we are all still children, right? Har har! Pranks are so kewl!!!1!"


Are you alright?


I got a good laugh out of it.


> The only modern language that gets it right is Swift:

I disagree.

What is the "right" things is use-case dependent.

For UI it's glyph bases, kinda, more precise some good enough abstraction over render width. For which glyphs are not always good enough but also the best you can get without adding a ton of complexity.

But for pretty much every other use-case you want storage byte size.

I mean in the UI you care about the length of a string because there is limited width to render a strings.

But everywhere else you care about it because of (memory) resource limitations and costs in various ways. Weather that is for bandwidth cost, storage cost, number of network packages, efficient index-ability, etc. etc. In rare cases being able to type it, but then it's often us-ascii only, too.


That is why I like the way Raku handles it.

It has distinct .chars .codes and .bytes that you can specify depending on the use case. And if you try to use .length is complains asking you to use one of the other options to clarify your intent.

  my \emoji = "\c[FACE PALM]\c[EMOJI MODIFIER FITZPATRICK TYPE-3]\c[ZERO WIDTH JOINER]\c[MALE SIGN]\c[VARIATION SELECTOR-16]";
  say emoji; #Will print the character
  say emoji.chars; # 1 because on character
  say emoji.codes; # 5 because five code points
  say emoji.encode('UTF8').bytes; # 17 because encoded utf8
  say emoji.encode('UTF16').bytes; # 14 because encoded utf16


*nod*

Rust was given as one of the examples and Rust's .len() behaviour is chosen based on three very reasonable concerns:

1. They want the String type to be available to embedded use-cases, where it's not reasonable to require the embedding of the quite large unicode tables needed to identify grapheme boundaries. (String is defined in the `alloc` module, which you can use in addition to `core` if your target has a heap allocator. It's just re-exported via `std`.)

2. They have a policy of not baking stuff that is defined by politics/fiat (eg. unicode codepoint assignments) into stuff that requires a compiler update to change. (Which is also why the standard library has no timezone handling.)

3. People need a convenient way to know how much memory/disk space to allocate to store a string verbatim. (Rust's `String` is just a newtype wrapper around `Vec<u8>` with restricted construction and added helper functions.)

That's why .len() counts bytes in Rust.

Just like with timezone definitions, Rust has a de facto standard place to find a grapheme-wise iterator... the unicode-segmentation crate.


Swift made an effort to handle grapheme clusters but severely over-complicated strings by exposing performance details to users. Look at the complex SO answers to what should be simple questions, like finding a substring: https://news.ycombinator.com/item?id=32325511 , many of which changed several times between Swift versions

I was working on an app in Swift that needed full emoji support once. Team ended up writing our own string lib that stores things as an array of single-character Swift strings.


> many of which changed several times between Swift versions

This was true while Swift was developing but it's been stable now for several years. At some point that complaint is no longer valid.


You still see all the answers from old versions sitting around, often at the top. Part of it is because of how often they changed such fundamental things. String length changed 3 times. Every other language figured these things out before the initial non-beta release.


The last time the string API changed was in 2017. That was 6 years ago.


Also, realized "needed full emoji support" sounds silly. It needed to do a lot of string manipulation, with extended grapheme clusters in mind, mainly for the purpose of emojis.


Arguably, you don’t need any (default) length at all, just different views or iterators. When designing a string type today, I wouldn’t add any single distinguished length method.


Swift string type has got many different views, like UTF-8, UTF-16, Unicode Scalar, etc… so if you want to count the bytes or cut over a specific byte you still can.


that's not the issue

defaults matter

as in they should things you can just use by-default without thinking about it

as swift is deeply rooted in UI design having a default of glyphs make sense

and as rust is deeply rooted in unix server and system programming utf-8 bytes make a lot of sense

through the moment your language becomes more general purpose you could argue having a default in any way is wrong and it should have multiple more explicit methods.


> as in they should things you can just use by-default without thinking about it

That time has passed. If you want to know the length of a string, you really should indicate what length type you mean.


There was no string.length in Swift for a while. Then they added one that just does what the user expects, get the number of grapheme clusters. If a user figures out that this isn't what they want, they can go use the other length method.


except if it's server swift code, then it doesn't do what the user expects at all


The most common reason I can think of a server wanting the string length is because it's enforcing a character limit in some field, in which case it does what the user expects. You aren't often manually managing memory in Swift. On top of that, Swift on server is probably rare to begin with.


> For example, é (a single grapheme) is encoded in Unicode as e (U+0065 Latin Small Letter E) + ´ (U+0301 Combining Acute Accent). Two code points!

It's a poor and misleading example for it is definitely not how 'é' is encoded in 99.999% of all the text written in, say, french out there (french is the language where 'é' is the most common).

'é' is U+00F9, one codepoint, definitely not two.

Now you could say: but it is also the two codepoints one. But that's precisely what makes Unicode the complete, total and utter clusterfuck that it is.

And hence even an article explaining what every programmer should know about Unicode cannot even get the most basic example right. Which is honestly quite ironic.


> Unicode the complete, total and utter clusterfuck that it is.

Yikes, does it really deserve that much derision? They’re trying to standardize all written human language here. I think they’ve done a fantastic job. Pre-Unicode you had to worry about what code page a document had, and computers from different countries couldn’t interoperate. The work the consortium does is hugely important, and every decision has extremely complex tradeoffs. Composed characters makes a lot of sense, and there’s a lot of case to be made that it was the right call. The attitude of “this one thing I don’t like makes the whole thing a complete clusterfuck” is something I wish fewer engineers would have.


> Yikes, does it really deserve that much derision?

To my simple mind it had one job: allocate every grapheme a number (code point). Had it done that, the 1/2 of the article warning you about the difficulty of iterating and modifying code points would have disappeared.

But I guess it had a 2nd job: create a way of representing those numbers. The obvious way, u32, was difficult for ASCII users swallow as it quadrupled the space used for a string. The solution we settled on, UTF-8 didn't come form Unicode (or ISO). It came from Ken Thompson (the Ken Thompson, who created B, the predecessor of C), when he tired to make something workable for C.

Unicode was the entity that ballsed up both of those tasks. It was a fork of ISO 10646. It's main contribution over 10646 was UCS-2 - ie 16 bits per character. That decision was so bad it had to be abandoned. Later they introduced the grapheme clusters rather than allocating a separate code point for each variant. I have no idea why, as it makes the programmers task far harder. Maybe they ran out of code points. How could they possibly run out of code points, given U32 has 4 billion of them and UTF-8 could potentially have more? Because they had to kludge their way around the USC-2 mistake to create UTF-16, and it's limited 1 million.

Which leads us to the one thing in the article I disagree with:

> The only downside of UTF-16 is that everything else is UTF-8, so it requires conversion every time a string is read from the network or from disk.

No, that's not the only downside. There is one more: USC-2 / UTF-16 has the endianness problem. A 16 bit value needs two bytes to represent it, and you can write two bytes to storage in two ways - little endian or big endian. They didn't specify, so the same string can have two different representations on disk. They added the infamous BOM markers to distinguish between them.

I could go on, but colour me singularly unimpressed with this mob.


> It was a fork of ISO 10646.

It never was. The earlier draft of ISO/IEC 10646 bears absolutely no resemblance with the current 10646 and Unicode (for example, the first character in ISO/IEC DIS 10646:1990 was 0x20202020, which I believe is mapped to a space U+0020). Unicode had a much better design compared to 10646 so the final 10646 was retrofitted to Unicode instead.

> It's main contribution over 10646 was UCS-2 - ie 16 bits per character. That decision was so bad it had to be abandoned.

UCS-2 was already in 10646 in the draft stage. It had an even worse mechanism than surrogate pairs: escape sequences from ISO/IEC 2022 to switch groups and planes (upper 16 bits of code point). The standardized UCS-2 doesn't have them because of the merger of then-16-bit Unicode.

> Later they introduced the grapheme clusters rather than allocating a separate code point for each variant.

That sounds like that Unicode initially allocated separate code points for each variant. They didn't, or rather couldn't. An easy example is a Latin character with combining marks. There are a lot of combining marks in existence, some even defined before Unicode (yes, it's not the Unicode invention!), and sometimes a single character can have multiple marks. So Unicode only gave separate code points for compatibility, and otherwise resorted to the normalization mechanism that understands how to handle such cases.

The concept of grapheme cluster naturally arises from the existence of normalization. Not in the strictest sense, but it can be thought as a closed set over normalization and concatenation, so that it roughly matches with user-perceived characters. So grapheme clusters were already there, only the precise algorithm was specified later.

> To my simple mind it had one job: allocate every grapheme a number (code point).

You can easily have more than 10M code points in this way. The current set of Hangul syllables, precomposed or not, is 125 * 95 * 138 = 1,638,750 characters. Latin characters with at most 3 combining marks (known to exist in the wild) would be probably in the same order of magnitude. Maybe now you can try, thanks to the computing power and all the information, but in 1990? Fat chance.

> Maybe they ran out of code points. How could they possibly run out of code points, given U32 has 4 billion of them and UTF-8 could potentially have more?

For last 20 years the rate was about 2,700 new code points per year. It would take more than 200 years to fill all other unassigned planes at this rate. And most "new" code points (in quantity) are for rare or ancient Han characters, which are technically unbounded but strongly bounded by existing ancient works and scholarly works to uncover them. I doubt there remain more than 100,000 potentially encodable Han characters.


Excellent reply.

> You can easily have more than 10M code points in this way. The current set of Hangul syllables, precomposed or not, is 125 * 95 * 138 = 1,638,750 characters. Latin characters with at most 3 combining marks (known to exist in the wild) would be probably in the same order of magnitude. Maybe now you can try, thanks to the computing power and all the information, but in 1990? Fat chance.

It can be made to work both ways. The current situation pushes the handling of compose points onto the application programmer. Every time he wants to index into a array of characters, maybe to handle backspace of the user pressing arrow keys, he's forced to handle composition. But your average programmer tasked with writing gathering some information from the web or creating an accounting package doesn't care about this stuff, so he's going to stuff it up 10 times out of 10. That why the sorts of problems illustrated original article are legion today.

The alternative is 10M code points as you say. But it doesn't have to be 10M real code points. Someone down the software stack, a piece of software could say "oh, this code point represents a composed grapheme, I'll break it down into it's parts". In fact "break it down into it's parts" might mean turn it into exact representation we have now.

The difference between the two alternatives is who has to do the work. With the composition approach, the font rendering library has it slightly easier but the application write has to do more work. In the 10M code points approach, the application writers job has been made easier, at the expense of the font rendering library coder job has become harder.

It seems pretty obvious to me which of those two approaches wins. There are literally orders of magnitude more end user applications than there are font rendering libraries out there, and what's more the font rendering library programmers are far more likely to very concerned about doing grapheme clustering right. So if you took the 10M code point approach, you would have saved the planet a lot of code, and got a better result to boot.

As for the rest - your correction that 10646 was the source of many of them problems is appreciated. But that doesn't alter the fact that from a programmers perspective the spec is far harder to implement at the business end than it should be. The problems started with USC-2 and it compounded form there. And as a consequence, we have a large number of font rendering bugs we could have escaped had the spec been done differently.


I appreciate your reply, which I never expected in this situation.

If my understanding is correct, your thesis is that Unicode should be hidden from application programmers as much as possible, much like the fact that GC hides memory management so to say. Not to say Unicode is bad or even shouldn't exist at all, but something like that it has to be abstracted away. This is a much more reasonable than most (quote-unquote) Unicode criticisms indeed. I'm not sure whether this is possible in the near future however, for reasons I'll work out here.

----

From perspectives of API consumers, most if not all programming languages have a suboptimal design for the human text. In fact the type name "string" itself is inappropriate, its name comes from the assumption that a human text is a string of symbols, which is not incorrect but not helpful either. A proper "human text" type (or a collection of them) should ideally be able to do the following:

- An abillity to contain additional linguistic informations like locales, grammatical genders or numbers if possible. Some can be guessed, some can be retrieved from external contexts (e.g. HTTP `Accept-Language`), some have to be retrived with a consent. Any text operation should retain them if the corresponding text is also retained.

- A language-aware formatter. For example `"Total: ${n} files"` should automatically change "files" to "file" when `n` is 1. Moreover, `"Total: ${n} ${objectName}"` should do the same if `objectName` is an English text "files". (This is why every text should retain linguistic informations!) Of course the format text should be translatable (say, to "파일 총 ${n}개" in Korean) and that shouldn't change the original code.

- Proper textual isolation. If my text is composed of multiple scripts or languages, they should not affect each other in any way, and should be displayed in the best way possible. For example a missing font should not give broken boxes; either the font should be downloaded on demand, or a note about missing font should be shown instead. Inserting RTL texts into LTR texts should not flip either of texts (unless it is required by the surrounding languages). Basically, no surprises even if you don't know about them.

- Situation-aware alternatives. Even after the formatting, a long text that doesn't fit into the UI should be shortened in the way that as many information is preserved as possible. For example the text "Nice to meet you, ImagineAVeryLongUserNameHere!" will be cut into "Nice to meet you, ImagineAVery..." today, but one should be able to turn this into "Hello, ImagineAVeryLongUserNam..." from the formatting layer.

None of these operations actually concern Unicode, but they are incredibly hard---if not impossible---to build. Most of them are at best fuzzily defined or often undefinable. So we are left with a number of localization and internationalization libraries which are ignored by most developers to say the least. A mere "string" type is a norm, a program has no idea about the text and proceeds with faulty assumptions, and users are so accustomed to bad text handling that they even don't expect much. If enough users complain there is a chance of improvements, but even that is done by a case-by-case basis.

---

Unicode sits at the level much lower than what I've imagined before. It is not even a component to build the human text. It is a component to build a string, that can be somehow used to build the human text if one is very careful. Most human text operations can't be done with Unicode alone.

For example, people argue that the number of "characters" in a single Emoji sequence (say, one mentioned in https://hsivonen.fi/string-length/) should be 1 and others don't make sense. This is meaningless because it will appear as a number of broken boxes if emoji fonts are not installed anyway (but not five, because it contains two default-ignorable code points). It matters what the number of "characters" is used for, and that's a whole point of the linked article. And the definition of user-perceived characters does vary over locales, so you can't count them without a linguistic information anyway.

You may still argue that Unicode algorithms are designed for the human text encoded in strings. That's a very nuanced argument, because one can also argue that they are the best effort approximation of human text operations for strings, in which case they are not the human text operations themselves. For example many languages have a case conversion operation over strings, with a varying degree of Unicode conformance (ASCII-only, simple fold, full fold, locale-dependent fold, title case, ...). But the case conversion itself is not the human text operation! Even assuming bicameral scripts, some texts are never capitalized (e.g. "McDonald" frequently capitalizes to "McDONALD", not "MCDONALD"). The human text operation, here full capitalization, needs much more than the Unicode case conversion algorithm.

Given this, it is a misguided effort to make strings more aligned with the human text, because it is not possible at all. As people frequently mistake strings as human texts however, the second best thing is to get rid of any string operation. Swift almost did this but retained a default grapheme view---I think it is actually worse given the instability of (extended) grapheme clusters over time, but also understand why they had to do that. [1] The third best thing is probably to stress that a string is not a human text, in the same way that a floating point number is not a real number. And the original post, in spite of some errors, did a good enough job in this regard.

[1] There is also a precedent of Raku's NFG (which dynamically allocates a negative code point for new grapheme clusters seen), but this is more or less an optimization of the graphemes view. The current Unicode has an infinite number of distinct grapheme clusters by design.


Thanks again for the very informative reply.

> You may still argue that Unicode algorithms are designed for the human text encoded in strings. That's a very nuanced argument, because one can also argue that they are the best effort approximation of human text operations for strings, in which case they are not the human text operations themselves. For example many languages have a case conversion operation over strings,

That is ... very nuanced indeed. Case folding does look to be difficult issue in it's own right, but it's not something I do a lot and besides even 20 years ago it was deferred to a library function (str.lower() or whatever the language provides).

The issue the original article correctly says every app trips over is neither nuanced, nor uncommon. It's boring stuff like handling backspace or left arrow cursor movement - stuff programmer have to do all the time, and that is difficult to put in a library. As a consequence they get it wrong over and over again. Unicode representing a grapheme with a single code point would fix that.

As you say I'm not sure there is a simple change you could make to Unicode that renders case folding or the other problems you describe easy. To me that's a strong hint it's not the right place to address those problems.


The author explains normalization in its own entire section several paragraphs later (Why is "Å" !== "Å" !== "Å"?).


> definitely not how 'é' is encoded in 99.999% of all the text written in, say, french out there

Maybe how it's input by the keyboard (I haven't checked) but not how it's output on the web or other documents.

Plenty of text goes through Unicode normalization which may convert it to two codepoints.


> Unicode the complete, total and utter clusterfuck that it is.

I'm inclined to agree - they took a big problem and made it into a big & complex problem.

I propose a fix - immediately depreciate all multi-codepoint graphemes and provide single codepoint alternatives. That we should need to normalise something a basic as our text leaves soooo much room for problems. And the idea that some graphemes encode magically colour (emojis) ... :rolling_eyes_emoji:


Unicode being an utter cluster fuck is my takeaway from this as well. Are there any alternatives to Unicode? Maybe something which has a single grapheme code point and uses utf32 for the encoding. Or anything else saner than what we have today?


Next time read the whole article before accusing the author of incompetence!

However, the author could have added a small note, e.g. "(Unicode normalization will be convered in a later section.)", to prevent knowledgable readers from rage quitting :)


I tried to read the articles since it seemed interesting. After exactly 30 seconds trying it I had to leave the page. Impossible to read more than two sentences with all those pointer moving there - and for a folk with ADHD even more difficult. Sorry, but I couldn't make it :(


Fortunately you didn't try the dark theme.


Use the reader mode. Or if you are under GNU/Linux, use Links/Lynx.


Not everyone runs Linux, and not every browser has a reader mode. This should not be the solution. There should definitely be an option to disable all these features, especially the dark mode toggle, that one's a fun premise, but horrific for usability.


True; but Links/Lynx exists for Windows, too. Or Netsurf. At least there are alternatives to choose. But you are right, the web sucks.


A real question is why IBM, Apple, and Microsoft poured millions into developing the unicode standard instead of treating character encoding like file formats as a venue for competition.

IBM and Apple in the early 1990's combined in Taligent to try to beat MS NT, but failed. But a lot of internationalization came out of that and was made open, at the perfect time for Java to adopt it.

Interestingly it wasn't just CJK but Thai language variants that drove much of the flexibility in early unicode, largely because some early developers took a fancy to it.

When you look at the actual variety in written languages, Unicode grapheme/code-point/byte seems rather elegant.

We're in the early days of term vectors, small floats, and differentiable numerics (not to mention big integers). Are lessons from the history of unicode relevant?


You can ask why they didn’t do the same for networking and serial protocols too.


Another Unicode article that mentions Swift, but not Raku :(

Raku's Str type has a `.chars` method that counts graphemes. It has a separate `.codes` method to count codepoints. It also can do O(1) string indexing at the grapheme level.

That Zalgo "word" example is counted as 4 chars, and the different comparisons of "Å" are all True in Raku.

You can argue about the merits of it's approach (indeed several commenters here disagree that graphemes are the "one true way" to count characters), but it feels lacking to not at least _mention_ Raku when talking about how different programming languages handle Unicode.


> The problem is, you don’t want to operate on code points. A code point is not a unit of writing; one code point is not always a single character. What you should be iterating on is called “extended grapheme clusters”, or graphemes for short.

It's best to avoid making overly-general claims like this. There are plenty of situations that warrant operating on code points, and it's likely that software trying and failing to make sense of grapheme clusters will result it in a worse screwup. Codepoints are probably the best default. For example, it probably makes the most sense for programming languages to define strings as arrays of code points, and not characters or 16-bit chunks or an encoding, or whatever.


> There are plenty of situations that warrant operating on code points

Absolutely correct. All algorithms defined by the Unicode Standard and its technical reports operate on the code point. All 90+ character properties defined by the standard are queried for with the code point. The article omits this information and ironically links to the grapheme cluster break rules which operate on code points.


The article doesn't say not to use code points, it says you should not be iterating on them.

Very rarely will you be implementing those algorithms. And if you're looking at character properties, the article says you should be looking at multiple together, which is correct.


> And if you're looking at character properties, the article says you should be looking at multiple together, which is correct.

I don't see where the article mentions Unicode character properties [1]. These properties are assigned to individual characters, not groups of characters or grapheme clusters.

> Very rarely will you be implementing those algorithms.

True, but character properties are frequently used, i.e. every time you parse text and call a character classification function like "isDigit" or "isControl" provided by your standard library you are in fact querying a Unicode character property.

[1] https://unicode.org/reports/tr44/#Properties


> These properties are assigned to individual characters, not groups of characters or grapheme clusters.

But you need to deal with the whole cluster. You can't just look at the properties on a single combining character and know what to do with it.

If the article's saying to iterate one cluster at a time, then if you're doing properties a direct consequence is that you should be looking at the properties of specific code points per cluster or all of them.


The Unicode Standard does not specify how character properties should be extracted from a grapheme cluster. Programming languages that define "character" to mean grapheme cluster (like Swift) need to establish their own ad-hoc rules.

As others have pointed out in this thread, the article is full of the authors own personal opinions. The author suggests iterating text as grapheme clusters, but fails to consider that this breaks tokenizers, e.g. a tokenizer for a comma-separated list [1] won't see the comma as "just a comma" if the value after it begins with a combining character.

[1] https://en.wikipedia.org/wiki/Comma-separated_values


If some tokenizer of a comma-separated list treats the comma (I'm assuming any 0x2C byte) as "just a comma" even if the value after it begins with a combining character, that's a broken, buggy tokenizer, and one that can potentially be exploited by providing some specifically crafted unicode data in a single field that then causes the tokenizer to misinterpret field boundaries. If you combine a character with something, that's not the same character anymore - it's not equal to that, it's not that separator anymore, and you can't tell that unless/until you look at the following codepoints. If the combined character isn't valid, then either the message should be discarded as invalid or the character replaced with U+FFFD, the Replacement Character, but it should definitely not be interpreted as "just a comma" simply because one part of some broken character matches the ASCII code for a comma.

If anything, your example is an illustration why it's dangerous to iterate over codepoints and not graphemes. Unless you're explicitly tranforming encodings to/from unicode, anything that processes text (not the encoding, but actual text content, like tokenizers do) - should work with graphemes as the basic atomic indivisible unit.


> The Unicode Standard does not specify how character properties should be extracted from a grapheme cluster. Programming languages that define "character" to mean grapheme cluster (like Swift) need to establish their own ad-hoc rules.

Right. Which means not just iterating by code point.

> The author suggests iterating text as grapheme clusters, but fails to consider that this breaks tokenizers, e.g. a tokenizer for a comma-separated list [1] won't see the comma as "just a comma" if the value after it begins with a combining character.

I don't think they're talking about tokenizers. It's a general purpose rule.

Also I would argue that a CSV file with non-attached combining characters doesn't qualify as "text".


Situations such as?

Sometimes editing wants to go inside clusters but that's not code-point based either.

I'd say that in a big majority of situations, code that is indexing an array with code points is either treating the indexes as opaque pointers or is doing something wrong.


This is pretty good. One thing I would add is to mention that Unicode defines algorithms for bidirectional text, collation (sorting order), line breaking and other text segmentation (words and sentences, besides grapheme clusters). The main point here is to know that there are specifications one should take into account when topics like that come up, instead of just inventing your own algorithm.


> Unicode is a standard that aims to unify all human languages, both past and present, and make them work with computers.

This is doubly wrong.

First, it conflates languages and writing systems. Malay and English use the same writing system but are different languages. American Sign Language is a language, but it has no standard or widely-adopted writing system. Hakka is a language, but Hakka speakers normally write in Modern Standard Mandarin, a different language.

Second, it's not that case that Unicode aims to encode all writing systems. For example, there are many hobbyist neographies (constructed writing systems) which will not be included in Unicode.


> Second, it's not that case that Unicode aims to encode all writing systems. For example, there are many hobbyist neographies (constructed writing systems) which will not be included in Unicode.

Doesn't the "private use space" technically satisfy this?


If you consider that it just "aims to" and makes no claim of succeeding to unify all languages, it isn't that wrong


Unicode is a total mess. In a sane system, "extended grapheme clusters" would equal "codepoints" and it wouldn't make a difference for 99% of languages. Now we ended up with grapheme clusters, normalization, decomposition, composition, Zalgo text, etc. But instead of deprecating this nonsense, Unicode doubled down with composed Emojis.


The writing systems were already like this when we got them. Unicode's "total mess" mostly just reflects that. Of course it would be convenient for you, the programmer, if the users wanted the software to do whatever was easiest for you, but obviously they want what's easiest for them, not you.


How is it easiest "for them" to have the mess instead of having the newer standard be less messy?


because the current mess means all their old stuff still works. ASCII is good so long as you only need English (or any other latin languages without the various accents), which was good enough for a long time - and ASCII was also carefully designed to make programming easier - flip one bit changes lower/uppercase for example, but there are more things it makes easy. By the time we realized we actually care about the rest or the world it was too late to make a nice system.


The real world never respected your artificial ASCII limitations, so that part never worked because people always needed more. But the original comment states that composition is a source of mess, not the ASCII having the same code point, and that's the puzzling part


Name one writing system where you really need character composition. Even if there is one, these special cases should be handled outside of Unicode.


you can't not handle devanagari, tamil (or like half the scripts across the Indian subcontinent and oceania) or hangul. even the IPA, used by linguists every day, would be particularly bad to deal with if we couldn't write things like /á̤/, and some languages already don't have the precomposed diacritics for all letters (like ǿ), so the idea of a world with only precomposed letter forms is more of a exponential explosion in the character set


> so the idea of a world with only precomposed letter forms is more of a exponential explosion in the character set

"Exponential explosion" is really putting it too strong; it's perfectly possible to just add ǿ and á̤ and a bunch of other things. The combinations aren't infinite here.

The problem with e.g. Latin script isn't necessarily that combining characters exist, but that there's two ways to represent many things. That really is just a "mess": use either one system or the other, but not both. Hangul has similar problems.

Devanagari doesn't have any pre-compose characters AFAIK, so that's fine.

That's really the "mess": it's a hodgepodge of different systems, and you can't even know which system to use a lot of the time because it's not organised ("look it up in a large database"), and even taking in to account historical legacy I don't think it really needed to be like this (or is even an unfixable problem today, strictly speaking).

At least they deprecated ligatures like st and fl, although recently I did see ij being used in the wild.


> The combinations aren't infinite here.

They certainly are. Languages are a creative space driven by the human imagination. Give people enough time and they'll build new combinations for fun or for profit or for research or for trying to capture a spoken word/tone poem in just the right sort of exciting way. You may frown on "Zalgo text" [1] (and it is terrible for accessibility), but it speaks to a creative mood or three.

The growing combinatorial explosion in Unicode's emoji space isn't an accident or something unique to emoji, but a characteristic that emoji are just as much a creative language as everything else Unicode encodes. The biggest difference is that it is a living language with a lot of visible creative work happening in contemporary writing as opposed to a language some monks centuries ago decided was "good enough" and school teachers long ago locked some of the creative tools in the figurative closets to keep their curriculum simpler and their days with fewer headaches.

[1] https://en.wikipedia.org/wiki/Zalgo_text


Well, in theory it's infinite, but in reality it's not of course.

We've got 150K assigned codepoints assigned, leaving us with 950K unassigned codepoints. There's truly massive amounts of headroom.

To be honest I think this argument is rather too abstract to be of any real use: if it's a theoretical problem that will never occur in reality then all I can say is: <shrug-emoji>.

But like I said: I'm not "against" combining marks, purely in principle it's probably better, I'm mostly against two systems co-existing. In reality it's too late to change the world to decomposed (for Latin, Cyrillic, some others) because most text already is pre-composed, so we should go full-in on pre-composed for those. With our 950k unassigned codepoints we've got space for literally thousands of years to come.

Also this is a problem that's inherent in computers: on paper you can write anything, but computers necessarily restrict that creativity. If I want to propose something like a "%" mark on top of the "e" to indicate, I don't know, something, then I can't do that regardless of whether combining characters are used, never mind entirely new characters or marks. Unicode won't add it until it sees usage, so this gives us a bit of a catch-22 with the only option being mucking about with special fonts that use private-use (hoping it won't conflict with something else).


The Unicode committees have addressed this for languages such as Latin, Cyrillic, and others and stated outright that decomposed forms should be preferred and decomposition canonical forms are generally the safest for interoperability and operations such as collation (sorting) and case folding (lowercase to uppercase transformations).

Unicode can't get rid of the many precombined characters for a huge number of backward compatibility reasons (including compatibility with ancient Mainframe encodings such as EBCDIC which existed before computer fonts had ligature support), but they've certainly done what they can to suggest the "normal" forms in this decade should "prefer" the decomposed combinations.

> If I want to propose something like a "%" mark on top of the "e" to indicate, I don't know, something, then I can't do that regardless of whether combining characters are used

This is where emoji as a living language actually shines a living example: It's certainly possible to encode your mark today as a ZWJ sequence, say «e ZWJ %», though you might want to consider for further disambiguation/intent-marking adding a non-emoji variation selector such as Variation Selector 1 (U+FE00) to mark it as "Basic Latin"-like or "Mathematical Symbol"-like. You can probably get away with prototyping that in a font stack of your choosing using simple ligature tools (no need for private-use encodings). A ZWJ sequence like that in theory doesn't even "need" to ever be standardized in Unicode if you are okay with the visual fallback to something like "e%" in fonts following Unicode standard fallback (and maybe a lot of applications confused by the non-recommended grapheme cluster). That said, because of emoji the process for filing new proposals for "Recommended ZWJ Sequences" is among the simplest Unicode proposals you can make. It's not entirely as Catch-22 on "needs to have seen enough usage in written documents" as some of the other encoding proposals.

Of course, all of that is theory and practice is always weirder and harder than theory. Unicode encoding truly living languages like emoji is a blessing and it does enable language "creativity" that was missing for a couple of decades in Unicode processes and thinking.


> The Unicode committees have addressed this for languages such as Latin, Cyrillic, and others and stated outright that decomposed forms should be preferred

Yes, and that only makes things worse since the overwhelming majority of documents (99.something% last time I checked) uses pre-composed. Also AFAIK just about everyone just ignores that recommendation.

This is a classic "reality should adjust to the standard" type of thinking. Previous comments about that: https://news.ycombinator.com/item?id=36984331

I suppose "e ZWJ %" is a bit better than Private Use as it will appear as "e%" if you don't have font support, but the fundamental problem of "won't work unless you spend effort" remains. For a specific niche (math, language study, something else) that's okay, but for "casual" usage: not so much. "Ship font with the document" like PDF and webfonts do is an option, but also has downsides and won't work in a lot of contexts, and still requires extra effort from the author.

I'm not saying it's completely impossible, but certainly harder than it used to be, arguably much harder. I could coin a new word right here and now (although my imagination is failing me to provide a humorous example at this moment) and if people like it, it will see usage. In 1960s HN when we would have exchanged these things over written letters, and it would have been trivial to propose a "e with % on top" too, but now we need to resort to clunky phrases like this (even for typewriters you can manually amend things, if you really wanted to).

Or let me put it this way: something like ‽ would see very little chance of being added to Unicode if it was coined today. Granted, it doesn't see that much use, but I do encounter it in the wild on occasion and some people like it (I personally don't actually, but I don't want to prevent other people from using it).

None of this is Unicode's fault by the way, or at least not directly – this is a generic limitation of computers.


> Yes, and that only makes things worse since the overwhelming majority of documents (99.something% last time I checked) uses pre-composed.

It shouldn't matter what's in the wild in documents. That's why we have normalization algorithms and normalization forms. Unicode was built for the ugly reality of backwards compatibility and that you can't control how people in the past wrote. These precomposed characters largely predate Unicode and were a problem before Unicode. Unicode won in part because it met other encodings where they were rather than where they wished they would be. It made sure that mappings from older encodings could be (mostly) one-to-one with respect to code points in the original. It didn't quite achieve that in some cases, but it did for, say, all of EBCDIC.

Unicode was never in the position to fix the past, they had to live with that.

> This is a classic "reality should adjust to the standard" type of thinking.

Not really. The Unicode standard suggests the normal/canonical forms and very well documented algorithms (including directly in source code in the Unicode committee-maintained/approved ICU libraries) to take everything seen in the wilds of reality and convert them to a normal form. It's not asking reality to adjust to the standard, it is asking developers to adjust to the algorithms for cleanly dealing with the ugly reality.

> Or let me put it this way: something like ‽ would see very little chance of being added to Unicode if it was coined today.

Posted to HN several times has been the well documented proposal process from start to finish (it succeeded) of getting common and somewhat less common power symbols encoded in Unicode. It's a committee process. It certainly takes committee time. But it isn't "impossible" to navigate and is certainly higher than "little chance" if you've got the gumption to document what you want to see encoded and push the proposal through the committee process.

Certainly the Unicode committee picked up a reputation for being hard to work with in the early oughts when the consortium was still fighting the internal battles over UCS-2 being "good enough" and had concerns about opening the "Astral Plane". Now that the astral plane is open and UTF-16 exists, the committee's attitude is considered to be much better, even if its reputation hasn't yet shifted from those bad old days.

> None of this is Unicode's fault by the way, or at least not directly – this is a generic limitation of computers.

Computers do anything we program them to do and in general people find a way regardless of the restrictions and creative limitations that get programmed. I've seen MS Paint drawn symbols embedded in Word documents because the author couldn't find the symbol they needed or it didn't quite exist. It's hard to use such creative problem solving in HN's text boxes, but that from some viewpoints is just as much a creative deficiency in HN's design. It's not an "inherent" problem to computers. When it is a problem they pay us software developers to fix it. (If we need to fix it by writing a proposal to a standards committee such as the Unicode Consortium, that is in our power and one of our rights as developers. Standards don't just bind in one-direction, they also form an agreement of cooperation in the other.)


The thing with normalization is that it's not free, and especially for embedded use cases people seem quite opposed to this. IIRC it requires about ~100K of binary size, ~20K of memory, and some non-zero number of CPU cycles. This is negligible for your desktop computer, but for embedded use cases this matters (or so I've been told).

This comes up in specifications that have a broad range of use cases; when I was involved in this my idea was to just spec things so that there's only one allowed form; you'll still need a small-ish table for this, but that's fine. But that's currently hard because for a few newer Latin-adjacent alphabets some letters cannot be represented without a combining character.

So then you have either the "accept that two things which seem visually similar are not identical" (meh) or "exclude embedded use cases" (meh).

I never really found a good way to unify these use cases. I've seen this come up a few times in various contexts over the years.

> Posted to HN several times has been the well documented proposal process from start to finish (it succeeded) of getting common and somewhat less common power symbols encoded in Unicode.

Would this work for an entirely new symbol I invent today? It's not really the Unicode people that are "difficult" here as such, they just ask for demonstrated usage, which is entirely reasonable, and that's hard to get (or: harder than it was before computers) especially for casual usage. I'm sure that if some country adopts/invents a new script today, as seems to be happening in West-Africa at in recent years, the Unicode people are more than amendable to work with that, but "I just like ‽" is a rather different type of thing.


> Would this work for an entirely new symbol I invent today? It's not really the Unicode people that are "difficult" here as such, they just ask for demonstrated usage, which is entirely reasonable, and that's hard to get (or: harder than it was before computers) especially for casual usage.

Sure, they want demonstrated usage as inline in the flow of text as textual elements as opposed to purely iconography or design elements (because such things are outside of Unicode's remit, modulo some old Wingdings encoded for compatibility reasons and the fine line between emoji are expressive text and also emoji are useful for iconography in many cases). But at this point (again in contrast to the UCS-2/no-Astral-plane days) the committees don't seem to care how it was mocked up (do it on a chalkboard, do it in paint, do it in LaTeX drawing commands, whatever gets the point across) or how "casual" or infrequent the usage is, so long as you can state the case for "this is a text element" (not an icon!) used in living creative language expression. There's more "provenance" requirements for dead languages and they'll want some number of academic citations, but for living languages they've come to be flexible (no hard requirements) on the number of examples they need from the wild and where those are sourced from. Showing it in old classic documents/manuals/books, for instance, helps the case greatly, but the committees today no longer seem as limited to just what can be used to demonstrate usage. "I just like it" is obviously not a rock solid proposal/defense to bring to a committee (any committee, really), but that doesn't mean that is impossible for the committee to be swayed by someone making a strong enough "I just like it" case if they demonstrate well enough why they like it and how they use it and how they think other people will use it (and how those uses aren't just iconography/decorative elements but useful in the inline context of textual language).


Hangul already has precomposed syllables in Unicode. We still have several hundred thousand unassigned codepoints to deal with diacritics.


The intent of Unicode was to have a universal solution for humans. Excluding one case, even if it's remote, would defeat this mission statement.


Thai, Arabic, Hebrew, and Devanagari are important examples, I believe.


The problem is not that you need character composition for some writing systems. It's that there are no rules that would help with everything having an unique representation internally.

Even "put the code points forming the composed character in descending numerical order" would be better than nothing. If it was there from the start.

However, the Unicode commitee is too busy adding new emojis to make their standard sane.


There are rules for that, Unicode has standards (not only formal, but easily usable in most software libraries) for canonical forms that will collapse all the variations to a single representation.

But, of course, unicode can't define that the standard will cover only the canonical forms, and couldn't do that since the start, as it needed backwards compatibility with various pre-unicode encodings which had mutually incompatible principles, so it needed support for both composed and decomposed versions of the same characters.


> for canonical formS

There' your problem right there. Plural formS. It's not canonical if there are more of it.


> Unicode doubled down with composed Emojis.

Not just emojis, in general I believe Unicode has just said they're not going to add new pre-composed characters and that using combining characters is the Right Way™ to do things (well, the only way for newer scripts).

One of the downsides of writing down specifications is that they tend to attract people with Very Strong Opinions on the One And Only Right Way and will argue it to no end, and essentially "win" the argument just by sheer verbosity and persistence.

That's certainly what I've seen happen in a few cases, and is what happens on e.g. Wikipedia as well at times.

But yeah, emojis is even worse. Something things can look rather different depending on which invisible variation selector is present. We've got tons and tons of unassigned codepoints and we need to resort to these tricks to save a few of them?

Firefighter is "(man|woman|person) + ZWJ + firetruck". Clever, I guess. Construction worker is "Construction worker (+ ZWJ + (male sign|female sign))?" (absence is gender-neutral). Why are there 2 systems to encode this? Sigh...

All of this is too clever by a mile.

[1]: HN will strip stuff, but try something like:

  echo $'↔\ufe0f ↔\ufe0e'
May not display correctly in terminal, but can xclip it to a browser – screenshot: https://imgur.com/a/iFmBDQk


The first time I heard that Unicode would support emoji, I knew it would be a recipe for disaster. And I definitely was not disappointed.


I mean, I don't dislike the concept personally. I actually really hate how HN strips them.

But the technical implementation? Yeah, that could have gone a lot better IMHO.

One must also wonder if some things really had to be added in the first place, e.g. for people kissing it's:

  (person|man|woman)(skin-tone)? ZWJ <heart> ZWJ <kissing lips> ZWJ (person|man|woman)(skin-tone)?
This is NOT a complaint about that they added diversity as such, in principle I'm all for that, it's just that few seem to actually use these emojis, and both in terms of code and UI it all gets pretty complex; there's 98 combinations to choose from here.

I don't really get why <heart> or <killing lips> or <kissing face> isn't enough. That's actually what most people seem to use anyway, because who finds it convenient to pick all the correct genders and skin tones from the UI for both people?


> there's 98 combinations to choose from here.

Less than that since a default skin color can be set in most apps. I'm sure setting a gender will come soon so the entire first part of that emoji can be auto-guessed. Then its just showing the other options in the UI. Really all of this is UI design as even with the 98 combinations you can still display it as 4/5 options you drill down.

> who finds it convenient to pick all the correct genders and skin tones from the UI for both people?

I just checked and searching "kissing" in my iOS emoji keyboard inside Messenger showed just 4 of the emoji's your describing - defaulting both skin tones to my settings and then the four M/F pair ups. Plus some non-related kissing emojis like the cat kissing.


> defaulting both skin tones to my settings

But that's kind of wrong, no? The entire point is that you can choose both sides individually. What if you set it to black and want to kiss some white bloke?

If anything that only underscores my point that it's too complex and that no one is using them (certainly not as intended anyway).


That's on Apple not on emojis.

In the Windows 11 emoji picker it works like this:

1. Search "kissing". See two generic yellow people kissing. Notice a blue dot in the bottom right corner.

2. Clicking the emoji brings up previously used versions of the kissing emoji, with a + button.

3. Clicking + brings up a dialog like I described previously. Two generic figures at the top, then a row of skin tones.

4. You can click on each generic person and choose a gender, then select a skin tone. You can do this for each person in the group.

5. Click done. This emoji is now in your default emoji list and you won't need to recreate it again.


That seems like a lot of effort when you could have sent <kissing-lips>, <kissing-cat>, <kissing-face>, <heart>, or any number of other emojis, which is what my point was.


You still can! People who want more customizations can do so too. Plus it only takes the initial setup per emoji at least.


> I actually really hate how HN strips them.

Oh. So that's why HN discussion always looks sane. They strip the pollution.


I feel its the same as with any long standing computer system we have today. It was designed as more and more of the world came online and all the growing pains it came with. Could it be built from scratch today better? Yes. Will it? No. I suspect it will be around long after we are all dead. Same with IPv4 :V


For most software it doesn't really matter either.

I've written unicode-aware software for over a decade, doing a wide variety of programs, and I've never had to bother with all that mess.

If I'm parsing strings I'm looking for stuff in the 7-bit ASCII range which maps neatly onto the Unicode representations, and so I just need to take care to preserve the rest.

The only trouble I've had is that a lot of programmers haven't learned, or don't get, that text encoding is a thing and that it needs to be handled.

So they'll hand me an XML they claim is UTF-8 encoded, except that XML header was just copypasta and the actual XML document is encoded in some other system encoding like Windows-1252. Or worse, a mix of both.


Honestly I like ipv4 better than v6. I like having a NAT and easy addresses like 192.168.1.3 instead of fe80::210:5aff:feaa:20a2. They didn't need to mess with those things just to expand the address space, like how utf8 didn't require remapping ASCII.


IPv4.1 should have just had 39 bits, to be written like 999.999.999.999. (I know this wouldn't have actually had much effect, nobody is going to add new routes in the middle of "class A" spaces that already existed, so it would just give those that already had IP addresses more IP addresses. Additionally, people really abuse decimal addresses in horrifying ways; for example, Fios steals 192.168.1.100-192.168.1.150 for its TV service, and that range doesn't really correspond to anything that you can mask off in binary. It only makes sense in decimal, which is not what any underlying machinery uses. They should have given themselves a /26 or something. You get 3 for yourself (modulo the broadcast and gateway address), and they get 1 for TV.)


Having it actually be decimal might've been nice, but at this point people are used to the 1-254 range, and I think the least jarring addition of extra bits would be to simply extend it for the addresses that need them (and not for the ones that don't). So you could have 123.444.3.254 or longer like 123.444.3.254.12.43.


Go ahead and use fe80::3 as a link-local address.

For site local, fec0::3. Yeah site-local is discouraged but you can still do it. Or you can slightly misuse fd00::3.

You only get those latter 16 hex characters if you explicitly don't want to choose addresses.


Be the change you want to see in the world. If we're going to make huge breaking changes, might as well do it sooner rather than later.


With something as large as a end user language format for input, this is a change we ourselves cannot make, just as using another calendar for dates. Just because I want to use the year 2002023 calendar with 29.5 days per month, doesn't make it useful to others or myself really.


I think actually you could. A thought experiment:

The problems with Unicode are mostly to do with internal inconsistencies and churn, problems that usually only affect programmers.

1. Different ways to encode the same visually indistinguishable set of characters as code points leading to normal forms, text that compares unequal even when it appears to be identical, the disastrous "grapheme clusters" concept and so on.

2. Many different ways to encode the same sequence of code points as bytes. Not only UTF-32/16/8 but also curiousities like "modified UTF-8".

3. Emoji. A fractal of disasters:

3.a. Updates frequently. Neither Unicode nor software in general was built on the assumption that something as basic as the alphabet changes every year. If you send someone an emoji, can their device draw it? Who knows! In practice this means messaging apps can't rely on the OS system fonts or text handling libraries anymore which is a drastic regression in basic functionality.

3.b. (Ab)uses composition so much it's practically a small programming language, e.g. flags are composed of the two letter country code spelled using special characters. People are represented as as generic person plus skin color patch, families are represented using composed individual people etc.

3.c. Meaning of a character is theoretically specified but can subtly depend on the font used, e.g. people use a fruit emoji in visual puns because of how it looks specifically on Apple devices, so a "sentence" can make no sense if it's rendered with a different font.

3.d. Unbounded in scope. There's no reason the Unicode committee won't just keep adding new pictograms forever.

3.e. Encoded beyond the BMP which in theory every correct program should handle but in practice some don't because nobody except a few academics used characters beyond it much until emoji came along.

3.f. Disagreement over single vs double width chars, can only know this via hard-coded tables, matters for terminals and code editors.

Some of these can potentially be cleaned up outside of the Unicode consortium in backwards compatible ways. You could have a programming language that automatically normalized strings to fully composed form when deserializing from bytes, and then automatically folded semantically identical code points together (this would be a small efficiency win for some languages too). You could campaign to build a consensus around a specific normal form, like how UTF-8 gained consensus as a transfer encoding. You could also define a fork of Unicode (using private use areas?) that allocates a single code point to the characters that are unnecessarily using composition today but don't yet have one and then just subset out the concept of composition entirely.

Emoji are a big problem. It's tempting to say that these should not be encoded as characters at all. Instead there could be a set of code points that define bounds that contain a tiny binary subset of SVG, enough to recreate the Apple pixel art somewhat closely. Emoji would always be transmitted as inlined vector art. Text rendering libraries would call out to a little renderer for each encoded glyph, using a fast fingerprinting algorithm to deduplicate the bytes to an internal notion of a character. To avoid wire bloat, text can simply be compressed with a pre-agreed zstd or Brotli dictionary that contains whatever images happen to be popular in the wild. At a stroke this would avoid backwards compat problems with new emoji, enabling programs working with text to be upgraded once and then never again, eliminate all the ridiculous political committee bike-shedding over what gets added, let apps go back to using system text support and get rid of the bajillion edge cases that emoji have spewed all over the infrastructure.


Do your dates alpha-convert?


What sort of practical issues are you running into due to Unicode's codepoint compositionality?


It's unnecessary complexity and a security nightmare. Have you ever tried to implement Unicode normalization? A single bug in your code and malformed text can crash your application or worse.


It's hard for me to imagine how Unicode normalization could crash your application unless you have very convoluted memory management code.

What on earth are you doing that it's leading to crashes? Are you not validating the result?


iMessage has had several vulnerabilities related to this. Whatever the difficulties are, even Apple can't handle them sometimes.


I'm very skeptical, but willing to be proven wrong. What's the CVE?


First that comes to mind is the "effective power" one, https://nvd.nist.gov/vuln/detail/cve-2015-1157 There's also the "black dot" one, can't find the CVE though.


That seems like a truncation + display issue though, not a normalization issue.

https://www.reddit.com/r/apple/comments/37e8c1/malicious_tex...

In fact, I don't know that there's any reason to believe normalization happens at all in the process of executing this.


That's tricky, for sure. My 'workaround' has long been converting codepoints into byte sequences and creating a character dictionary from that. Based on the source corpus, this dictionary can be further expanded/compressed and used for downstream processing.


Normalization and the fact it is not forward-compatible.


But precomposing all the potential combinations is less sane than the current mess (and you can outlaw Zalgo in the standard if you think it's a serious issue)

Also, the % should measure people, not languages, that would greatly decrease the imaginary 99%


They kind of have to don’t they? Otherwise we’ll become space-limited way too fast? Especially with how quickly new emojis are being made and all their variants.


Prior to this article, I knew graphemes were a thing and that proper unicode software is supposed to count those instead of bytes or code points.

I didn't know that unicode changes the definition of grapheme in backwards incompatible fashion annually, so software which works by grapheme count is probably inconsistent with other software using a different version of the standard anyway.

I'm therefore going to continue counting bytes. And comparing by memcmp. If the bytes look like unicode to some reader, fine. Opaque string as far as my software is concerned.


The point is that a byte focus will often frustrate users.

e.g., a TUI with columns will have to truncate "long" strings in each column, and that truncation and column-separator arrangement really should be grapheme aware.

e.g., a string search (for a name, let's say) should find Noël regardless of whether the user input ë via composing characters or the pre-composed version.


> I didn't know that unicode changes the definition of grapheme in backwards incompatible fashion annually, so software which works by grapheme count is probably inconsistent with other software using a different version of the standard anyway.

This is EXACTLY why Rust's standard library is blind to graphemes. Support for the case where your company requires a specially certified toolchain that lags five years behind Rust upstream is an explicit goal that they address by breaking the stuff that changes quickly out into minimal crates that can be audited, updated at a quicker pace without requiring toolchain updates, and which have the option to continue to support older compiler versions indefinitely.


Two Unicode strings can be visually and semantically identical, but not byte-equal.


Comparing by memcmp will result in false negatives unless you can ensure that all incoming text gets normalized to a particular canonical form.


Even then there are minefields in comparing text, especially case insensitive matching and supporting CKJ.



> Before comparing strings or searching for a substring, normalize!

...and learn about the TR39 Skeleton Algorithm for Unicode Confusables. Far too few people writing spam-handling code know about that thing.

(Basically, it generates matching keys from arbitrary strings so that visually similar characters compare identical, so those Disqus/Facebook/etc. spam messages promoting things like BITCO1N pump-and-dumps or using esoteric Unicode characters to advertise work-from-home scams will be wasting their time trying to disguise their words.)

...and since it's based on a tabular plaintext definition file, you can write a simple parser and algorithm to work it in reverse and generate sample spam exploiting that approach if you want.

https://www.unicode.org/Public/security/latest/confusables.t...

> and CD-ROM!

I think you mean Microsoft Windows's Joliet extensions to ISO9660 which, by the way, use UCS-2, not UTF-16. (Try generating an ISO on Linux (eg. using K3b) with the Joliet option enabled and watch as filenames with emoji outside the Basic Multilingual Plane cause the process to fail.)

The base ISO9660 filesystem uses bytewise-encoded filenames.


But not all normalizations are done to fight spam, not all of them should be interested in visual similarity.

I normalize strings in searches not because of bad intents but because for all user related purposes "Comunicações" and "Comunicações" are the same, their different encodings being more of an accident.


*nod* ...and stemming is that taken to a greater extreme.

I was just pointing out that Unicode itself has various forms of normalization and normalization-adjacent functionality that people are far too unaware of.


Am I supposed to hate this website, cause I kinda do


The mustard background with black text is harsh on the eyes.


strange. I quite like it


Me too. I get the impression of a very saturated off-white yellow.

But any more saturation and it would go all mustard-electric on me.

That's an interesting observation on variation of saturation response. Feels like useful knowledge for ... web site designers. Or any color crafter.


FWIW - Right Click, Inspect. There's a div with an attribute, "pointers" in the body root.

Deleting that makes the while thing a lot less stressful.


Try the night mode (top right corner)...

It's black text on black background (I'm on mobile Firefox on Android).


Night mode is an absolute delight on desktop, you're missing out


On desktop your mouse pointer is a flashlight. I wonder if it supports touch.


It’s not pleasant to read. Strange, since Tonsky is the curator of the Fira Code font, and would presumably be interested in presentation


uBlock Origin -> Disable Javascript

Problem solved!


Firefox reader mode is better still.


That breaks the video. inspect -> network -> refresh -> blocking the request for pointers.js works.


It doesn't on Firefox, you get the built in media controls.


Toggle the dark mode for a real treat.


Now that is really funny.

Future _improvement_ idea: the mouse cursors are shared, so the light switch should be, too! Let me play with the light with everyone


I wondered about how to do simple text centering / spacing justification given graphemes showing string lengths that don't match up human-perceived characters, like in 'Café' (python len('Café') returns 5, even though we see four letters).

Found this! good to know about. https://pypi.org/project/grapheme/ "A Python package for working with user perceived characters. "

(apparently the article talks about this however the blog post is largely unreadable due to dozens of animated arrow pointers jumping all over the screen)


Just had this come up at work --- needed a checkbox in Microsoft Word --- oddly the solution to entering it was to use the numeric keypad, hold down the alt key and then type out 128504 which yielded a check mark when the Arial font was selected _and_ unlike Insert Symbol and other techniques didn't change the font to Segoe UI Symbol or some other font with that symbol.

Oddly, even though the Word UI indicated it was Arial, exporting to a PDF and inspecting that revealed that Segoe UI Symbol was being used.

As I've noted in the past, "If typography was easy, Microsoft Word wouldn't be the foetid mess which it is."


That's unrelated to unicode. The checkmark symbol just isn't in the Arial font, so Word just falls back to a font that has it - Segoe UI. You've found a bug where Word still thinks it's Arial. But this is something that would happened no matter what encoding you choose for your characters.


I don't know this for a fact, but it's possible that the text run is logically considered to be Arial and the fallback could be handled as just a rendering step, rather than being encoded in the document. Doing it that way could allow the text to render on different versions of Arial, some of which do have a checkbox char, at the risk of the appearance and layout changing depending on which fonts are installed.


The weird thing is, it didn't work thus for other numbers for used for this or similar characters.


Just a nitpick because the page says: "Unicode is a standard that aims to unify all human languages, both past and present, and make them work with computers." but of course unicode is only relevant to written languages as opposed to spoken languages (and signed languages)

I wish that was the only thing wrong with that page


Regarding UTF-8 encoding:

“And a couple of important consequences:

- You CAN’T determine the length of the string by counting bytes.

- You CAN’T randomly jump into the middle of the string and start reading.

- You CAN’T get a substring by cutting at arbitrary byte offsets. You might cut off part of the character.”

One of the things I had to get used to when learning the programming language Janet is that strings are just plain byte sequences, unaware of any encoding. So when I call `length` on a string of one character that is represented by 2 bytes in UTF-8 (e.g. `ä`), the function returns 2 instead of 1. Similar issues occur when trying to take a substring, as mentioned by the author.

As much as I love the approach Janet took here (it feels clean and simple and works well with their built-in PEGs), it is a bit annoying to work with outside of the ASCII range. Fortunately, there are libraries that can deal with this issue (e.g. https://github.com/andrewchambers/janet-utf8), but I wish they would support conversion to/from UTF-8 out of the box, since I generally like Janet very much.

One interesting thing I learned from the article is that the first byte can always be determined from its prefix. I always wondered how you would recognize/separate a unicode character in a Janet string since it may have 1-4 bytes length, but I guess this is the answer.


You CAN'T do any of these things in Unicode in general, in all of its encodings.

There's no random access in Unicode. It's a stateful system that requires linear scan.


Pretty clearly, "every software developer" doesn't need to understand Unicode with this level of familiarity, much like "every programmer" doesn't need to know the full contents of the 114 page Drepper paper. For example, I work on a GUID-addressed object store. Everything is in term of bytes and 128-bit UUIDs. Unicode is irrelevant to everyone on my team, and most adjacent teams. There is lots of software like this.


There seems to be quite large segments of developers working on functionality which "handles text" as immutable whole blobs, in which case one really doesn't have to know anything about unicode.

However, as soon as you want to look into the text contents of the objects of that object store and handle parts of it, even in the very simplest way (e.g. checking whether the stored object contains some character, or whether two text messages stored in that object store are the same) then you can't treat them as bytes anymore, and all the concerns listed in this article suddenly become relevant for your team.


Yes, exactly.


Glad I'm not the only one who was irked by this, and I do need to know a lot about Unicode for my job!

I believe there actually are topics that every software developer ought to know something about, but this isn't one of them. My list would be things more like, the difference between a constant-time algorithm and a quadratic-time one.


Every programmer don't have to remember all things in this article, but they should remember that Unicode (or text system in the wild) is actually complex so they should research as needed.


Sure, that's a reasonable take.


There is another 'modern' language that does utf8 right and has done it right for a long time. I know it's mostly fallen out of favour, but we're still out here: Perl.

$ perl -wle 'use utf8; print length("");' 1

Without use utf8: $ perl -wle 'print length("");' 3

It's funny: after Perl fell out of favour, is when it got all its best stuff. It's still my preferred language for just about everything.


> The only modern language that gets it right is Swift:

    print("...".count)
    // => 1
And Erlang/Elixir! I guess they are not "cool" enough. But they correctly interpret that as one grapheme cluster.

    % erl +pc unicode
    > string:length("...").
    1
(... here is the U+1F926 U+1F3FB U+200D U+2642 U+FE0F emoji)


The author does refer to Elixir further down:

> UPD: Erlang/Elixir seem to be doing the right thing, too.


Updated a day after it was mentioned here :-)

https://github.com/tonsky/tonsky.me/commit/5fbbb373025be3758...


Please don't refer to codepoints as characters. Some are, some are not, it isn't a useful or informative approximation, it's just wrong. Unicode is a table which assigns unique numbers to different codepoints, most of which are characters. ZWJ is not a character at all, and extended grapheme clusters made of several codepoints are.


'Character' doesn't have a single meaning. ZWJ is a character according to definitions (2) and (3) in https://unicode.org/glossary/#character


Wonderful to learn more about Unicode.

Does anyone know how to write a function (preferably in swift) to remove emoji? This is surprisingly hard (if the string can be any language, like English or Chinese).

There’s been multiple attempts on Stackoverflow but they’re all missing some of them, as Unicode is so complex.


I haven't tried but use libicu (icu). Split text into graphemes and remove anything starting with codepoints that has Zsey script. There should be swift bindings.


Here's a 1-liner, producing the string "text 0123 漢字":

`String("text EMOJI 0123 漢字".unicodeScalars.filter({ !$0.properties.isEmojiPresentation }))`

(I've had to substitute EMOJI for a smiley face, because HN is bad at text encoding.)


Thanks. Unfortunately both .isEmojiPresentation && .isEmoji leaves many emojis out, like red heart and many other.


Those aren't inherently emojis, the font just shows them as emojis, so you'd have to render the text.


Correct. `isEmojiPresentation` checks if, per the Unicode standard, this scalar should default to an emoji presentation.


It's not a bug, HN deliberately strips emojis.


> Among them is assigning the same code point to glyphs that are supposed to look differently, like Cyrillic Lowercase K and Bulgarian Lowercase K (both are U+043A).

This is nonsense, Bulgaria has been using the Cyrillic alphabet sinse its creation in … Bulgaria!

What you’ve shown is two different fonts, and both renderings are perfectly fine in Bulgaria.

Read up more about it on wikipedia: https://en.wikipedia.org/wiki/Bulgarian_alphabet


> normalization

A quick war-story on this: We had a system which was taking web-user input for human names, and then some of it had to be sanitized for a crappy third-party system. However some of the names were getting mangled in unexpected ways.

One of the (multiple) issues was that we were sometimes entirely dropping accented characters even when a good alternative existed. This occurred when we were getting "é" (U+00E9) instead of "é" (U+0065 U+0301), a regular letter E plus a special accent modifier. By forcing the second form (D normalization) we were able to strip "just the accents" and avoid excessively-wrong names.

Going further with K+D normalization, weird stuff like "⑧" (letter 8 in a circle) becomes a regular number 8.


Give https://manishearth.github.io/blog/2017/01/15/breaking-our-l... a read and, ideally, the other things it mentions like https://eev.ee/blog/2015/09/12/dark-corners-of-unicode/ and https://manishearth.github.io/blog/2017/01/14/stop-ascribing....

(Among other things, it points out that doing that to non-Latin text is liable to change pronunciations and meanings in other languages. For example, some languages use diacritics for voiced/unvoiced indication where your "normalization" could do things like "tick→dick" or "did→tit".)

(Did you ever notice that? B/P, D/T, V/F, G/K, J/CH, and Z/S form voiced/unvoiced pairs that could have been indicated with a single letter and a diacritic. Same mouth behaviour. It's just a question of whether you engage your vocal cords.)


  > The only modern language that gets it right is Swift:
arguably not true:

  julia> using Unicode

  # for some reason HN doesn't allow emoji
  julia> graphemes(" ")
  length-1 GraphemeIterator{String} for " "

  help?> graphemes
  search: graphemes

  graphemes(s::AbstractString) -> GraphemeIterator


  Return an iterator over substrings of s that correspond to the extended graphemes in the string, as defined by Unicode UAX #29. (Roughly, these are what users would perceive as single characters, even though they may contain more than one codepoint; for example a letter combined
  with an accent mark is a single grapheme.)


I imagine the author would disagree with that because it does not have the “right” behavior by default.

For example indexing and length of the string are done by codeunit. [1]

On the other hand Rakus Str type does behave similarly to Swifts: indexing, length and iteration by grapheme; view methods for specific encodings. [2]

[1]: https://docs.julialang.org/en/v1/base/strings/ [2]: https://docs.raku.org/type/Str#routine_chars


Raku also gets it right.


Julia is not a major language like Swift.


> The minimum every software developer must know about Unicode

Just a nitpick...

Once more, as it is typical on HN, web programming is confused with the entire universe of software development.

There are plenty of software realms where ASCII not only is enough, but it actually MUST be enough.


This kind of assertiveness leads to garbage like C++ still not supporting UTF8 properly in 2023. My name contains diacritics. I am so, so, so tired of trying to work around information systems - not just web frontends - designed by people who don't care or worse, don't want to care.

"Web" programmers can care all they want about Unicode, but if the backend people didn't deal properly with text encoding, then something will break no matter what.

> There are plenty of software realms where ASCII not only is enough, but it actually MUST be enough.

Name one.


> if the backend people didn't deal properly

You are right. It's not a frontend/backend issue. It's a "for human" vs "not for human" issues. Personal names must be treated in an international-friendly manner.

>> There are plenty of software realms where ASCII not only is enough, but it actually MUST be enough. > > Name one

Joel himself described an example:

> It would be convenient if you could put the Content-Type of the HTML file right in the HTML file itself, using some kind of special tag. Of course this drove purists crazy… how can you read the HTML file until you know what encoding it’s in?! Luckily, almost every encoding in common use does the same thing with characters between 32 and 127, so you can always get this far on the HTML page without starting to use funny letters:

The content of a webpage is required to be expressed in every supported language, but the HTTP protocol must not. And it would make no sense at all to add internationalization to intra-machines protocol, where ASCII is enough and has been enough for decades.

And if someone complains that ASCII only supports English, well... suck it up! I'm Italian and work in French, still I hate when a colleague sneaks in a comment not in English. Professional software development happens in English.


> The content of a webpage is required to be expressed in every supported language, but the HTTP protocol must not. And it would make no sense at all to add internationalization to intra-machines protocol, where ASCII is enough and has been enough for decades.

I guess no URLs with funny characters then. "GET /profile/renée" => 500 error, woohoo.

> And if someone complains that ASCII only supports English, well... suck it up! I'm Italian and work in French, still I hate when a colleague sneaks in a comment not in English. Professional software development happens in English.

Get over yourself, a lot of professional development happens in languages other than English.


> I guess no URLs with funny characters then. "GET /profile/renée" => 500 error, woohoo.

That's not really a slam dunk. Lots of sites don't let you have your name in the URL at all, and the average person's experience is that their name would be taken by someone else before they signed up.


Try looking at the moon rather than the finger next time.


Why? I'm not obligated to take sides here, and just because I agree with your overall point doesn't mean I can't point out problems with your argument.

In the metaphor, it's someone telling you your aim is off.


HTTP does support content-type tags and Unicode in URLs. Which funny enough comes in two different encodings, punycode and percent escapes.


I can name one. At my job we do the kind of embedded programming were encoders inside machines send data to each other. Like reading optical sensors and sending bits indicating state to other controllers.

We absolutely do not "need" to know about Unicode, outside of interest about other realms.


> This kind of assertiveness leads to garbage like C++ still not supporting UTF8 properly in 2023. My name contains diacritics.

UTF8 encoded diacritics work just fine in C++.


What do you mean by "work"? That you can store arbitrary bytes in a string? That's a pretty low bar.


> What do you mean by "work"? That you can store arbitrary bytes in a string? That's a pretty low bar.

That's all that's needed for a backend language.

The backend does not need to understand, or even acknowledge the existence, of grapheme clusters. Because the frontend is already having to understand all of this, it should be normalising any multi-codepoint ambiguous cluster anyway.


The backend never needs to do things like figure out how long a string is or search for one string in a database of other strings?


> The backend never needs to do things like figure out how long a string

Not as measured by clusters, no.

> search for one string in a database of other strings?

Hence I said "normalisation". The frontend already has to do all the unicode twiddling, it may as well normalise the input too.


It does if it ever wants to trim, summarize, sort or compare equal a string.


Well, proper Unicode support affects pretty much any area handling data about, used by, or created by, humans. That’s a pretty broad scope, and certainly wider than just web software.


What do you mean by “must be enough”?

Not being able to support non-latin scripts sounds more like a limitation than a feature to me, although of course in many contexts it’s not in any individual organizations power to overcome it.


Quotes from the article illustrating what a train wreck Unicode has become:

"The problem is, in Unicode, some graphemes are encoded with multiple code points!"

"An Extended Grapheme Cluster is a sequence of one or more Unicode code points that must be treatead as a single, unbreakable character."

"Starting roughly in 2014, Unicode has been releasing a major revision of their standard every year."

"Å" === "Å" "Å" === "Å" "Å" === "Å" What do you get? False? You should get false, and it’s not a mistake.

"That’s why we need normalization."

"Unicode is locale-dependent"

The article forgot one: characters that switch presentation to right-to-left.


This is a lot more than the minimum that every software dev must know about Unicode. Even if you only do web frontends, you will do fine not knowing most of this. Still a nice read, though.


What an interesting mess!

It occurs to me that a canonical semantic representation of all known (extracted) language concepts would be useful too.

Now that we have multi-language LLM's it would be an interesting challenge to create/design a canonical representation for a minimum number of base concepts, their relations and orthogonal "voice" modifiers, extracted from the latent representations of an LLM across a whole training set, over all training languages.

While the best LLMs still have complex reasoning issues, their understanding of concepts and voice at the sentence level is highly intuitive and accurate. So the design process could be automated.

The result would be a human language agnostic, cross-culture concept inclusive, regularized & normalized (relatively speaking) semantic language. Call it SEMANTICODE.

We need to get this right, using one standard LLM lineage, before the Unicode people create a super standard that spans 150 different LLM's and 150 different latent spaces! :O

Stability between updates would be guaranteed by including SEMANTICODE as a non-human language in training of future LLM's. Perhaps including a (highly) pre-normalized semantic artificial language would dramatically speed up and reduce the parameter count needed for future multi-language training?*

Then LLMs could use SEMANTICODE talk to each other more reliably, efficiently, and with greater concept specificity than any of our single languages.


Tonsky, dude.

I stopped reading your article because of your little websocket experiment.


It sounds like a generic length function in Unicode in 2023 is no longer a good idea. These articles complaining about the variety of lengths in Unicode are annoying at this point. Pretty much all of them can be summed up as, "Well, it depends." And, that isn't wrong. But nerds love to argue until they are blue in the face about the One Correct Answer. Sheesh.

This is the most interesting comparison article I have seen in years about Unicode processing in C++: https://thephd.dev/the-c-c++-rust-string-text-encoding-api-l...

The author is also the lead on an open source C++ Unicode library called ztd.txt: https://github.com/soasis/text


The fuck is up with the cursor nonsense? I would've read this thing if it wasn't for that.


The author seem to hate people which concentration issues and/or various visual sicknesses.

That coloration tools shows the moving mouse coursers of other participants even if they aren't needed/wanted is already pretty bad, why bring it to a website?


This seems like good feedback but it could really be phrased more constructively. I doubt the author “hates” any such thing and you know it too. “Didn’t design with such in mind”, sure. You can do better.


yes I should have highlighted that it is satire

through it also wasn't meant to be constructive critique


If you have to recognize a grapheme cluster, it will be easier to do that from a sequence of code points, than from UTF-8.

It's like saying that we don't need to tokenize, because you never want to deal with tokens anyway, but phrase structures!

Mmkay, whatever ...


>3 Grapheme Cluster Boundaries

>It is important to recognize that what the user thinks of as a “character”—a basic unit of a writing system for a language—may not be just a single Unicode code point. Instead, that basic unit may be made up of multiple Unicode code points. To avoid ambiguity with the computer use of the term character, this is called a user-perceived character. For example, “G” + grave-accent is a user-perceived character: users think of it as a single character, yet is actually represented by two Unicode code points. These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.


> These user-perceived characters are approximated by what is called a grapheme cluster, which can be determined programmatically.

From everything i've read or heard about unicode, "determined programmatically" is false?


I quoted Unicode specification, so no, it's not false.


Well I can find 30 comments stating that it's false for all practical purposes only in this HN thread. From people who have more experience with Unicode than me.


Oh my god, is there ever anything simple about unicode


Compared to the ancient world of EBCDIC versus ASCII versus various ISO standards versus country-defined encodings versus Extended EBCDIC code pages versus Extended ASCII code pages which varied depending on operating system, nearest flag pole, network adapter, time of day, etc…: Unicode will forever be a simpler walk in the park.

It's complexity is a relief compared to where we've been. It's definitely not simple, but it will forever be far simpler than what our grandmothers had to work with if they were writing international software.


Give the "Indic scripts" section of https://manishearth.github.io/blog/2017/01/15/breaking-our-l... a read.

TL;DR: Unicode is complicated because some non-Latin writing systems are complicated and those non-Latin writing systems account for over a quarter of the world's population. (They're either majority or present in India, Indonesia, Pakistan, Bangladesh, the Philippines, etc.)


We need, desperately and without question, two Unicode symbols for bold and italic.

These are part of language and should not be an optional proprietary add on that can be skipped or deleted from text. We've been using the two "formats" to convey important information since the sixteenth century!!!

It boggles my mind that we can give flesh tone to emojis, yet not mark a word as bold or italic. It makes zero sense. Especially how easy it would be to implement. It would work exactly the same way: Letters following the mark would be formatted as bold or italic until a space character or equivalent.


"the definition of graphemes changes from version to version"

In what twisted reality did someone think this a good idea?

Doesn't it go against the whole premise of everyone in the world agreeing on how to represent a meaningful unit of text?

"What’s sad for us is that the rules defining grapheme clusters change every year as well. What is considered a sequence of two or three separate code points today might become a grapheme cluster tomorrow! There’s no way to know! Or prepare!"

"Even worse, different versions of your own app might be running on different Unicode standards and report different string lengths!"


I can sympathize why some programmers would prefer to stick their heads in the sand and stick to ASCII.


> The simplest possible encoding for Unicode is UTF-32. It simply stores code points as 32-bit integers.

Skipping over UTF-32-BE and UTF-32-LE there...

(I mean, it might not be an issue if it's just being used as an internal representation, but still)


https://tonsky.me/blog/unicode/overview@2x.png

Wow, what abominable mix of decimal and hexadecimal.


Where are the decimal numbers in that image?


It goes 90000..9FFFF then 100000..10FFFF. The latter should have been A0000..AFFFF.

So the author is using hex for the last four digits and decimal for the remaining ones.


oops :) fixed, thanks!


What comes after 9FFFF?


Good catch

doh.


> what to you think "ẇ͓̞͒͟͡ǫ̠̠̉̏͠͡ͅr̬̺͚̍͛̔͒͢d̠͎̗̳͇͆̋̊͂͐".length should be?

This is a nice example of the kind of thing we need to think about when defining a measure of length for Unicode strings.


Four. Obviously.

The more interesting question is whether the Unicode rules actually give that answer.

EDIT: Just checked it using the first online tool [1] that came up and it indeed says four. So all is good.

[1] https://onlinetools.com/unicode/extract-unicode-graphemes


It should be 4 as long as you count the grapheme clusters which is what e.g. Swift does (hence String#count being O(n)).

In Javascript, you can get the same information through Intl.Segmenter, segments by grapheme cluster by default.


You could also have it in O(1), just store and maintain it as you usually store the length in bytes or code units. If you had all your string operations like substring work with grapheme clusters by default, which might arguably make sense quite often, then that could actually be a good decision. It might even make sense to maintain a list with pointers to each grapheme cluster or of all the grapheme cluster lengths together with the actual string data. Or maybe not, would probably depend heavily on the workload.


> Another unfortunate example of locale dependence is the Unicode handling of dotless i in the Turkish language.

This isn't quite Unicode's fault, as the alternative would be to have two codepoints each for `i` and `I`, one pair for the Latin versions and one for the Turkish versions, and that would be very annoying too.

Whereas the Russian/Bulgarian situation is different. There used to be language tags in Unicode for that, but IIRC they got deprecated, and maybe they'll have to get undeprecated.


The mouse cursos ir really annoying, stopped reading for that reason


I'm always gonna point out these overly broad titles assuming "every software developer" is some kind of internetty web dev type. I'm a game dev, I try and never touch strings at all, they are a nightmare data type. Strings in a game are like graphics or audio assets, your game might read them and show them to player, but they should never come anywhere near your code or even be manipulated by it. I dont need to know any of that stuff about Unicode.


The Why is "Å" !== "Å" !== "Å"? section still strikes me as wrong. The strings are equal even when the representations differ.


They are logically equal (that is, they represent the same text in an abstract way), but computing this equality in practice is expensive, because you first need to normalize the strings then compare.

Most languages, when comparing strings, skip the normalization and just compare string bytes as is (or, if the string is interned, compare just the pointer)


You can easily do the comparison dynamically with checking for combining marks, and then do the proper lookup. No need to normalize everything, or even store the normalized variant. Though in a filesystem or username lookup you would only store it normalized.


I just not sure why they put in the "Angstrom symbol" to begin with. If you do, then why isn't the "meter symbol" (m) also represented?

Fortunately, it seems like it's marked as deprecated: https://en.wikipedia.org/wiki/Angstrom#Symbol


> I just not sure why they put in the "Angstrom symbol" to begin with.

Frequently, the answer to this is "some obscure character set had this as a distinct symbol." In this case, blame the Japanese: https://en.wikipedia.org/wiki/JIS_X_0208

Which is why there's an 'mm' and 'cm' and other random symbols: https://www.compart.com/en/unicode/block/U+3300


Elixir also gets the length correctly, not only Swift.


Can we please get a standard that describes how emoji are supposed to look?

Now they look different on every platform and many subtleties are lost in translation.


Yeah, this problem has led me to avoid using emojis. I can't be sure that the meaning I was intending is the one being depicted by the recipients machine.

It's probably a good thing, though.


The only modern language that gets it right is Swift:

Apple did a fairly good job with unicode string handling starting in Cocoa and Objective-C, by providing methods to get the number of code points and/or bytes:

https://stackoverflow.com/questions/15582267/cfstring-count-...

I feel that this support of both character count and buffer size in bytes is probably the way to go. But Python 3 went wrong by trying to abstract it away with encodings that have unintuitive pitfalls that broke compatibility with Python 2:

https://blog.feabhas.com/2019/02/python-3-unicode-and-byte-s...

There's also the normalization issue. Apple goofed (IMHO) when they used NFD in HFS+ filenames while everyone else went with NFC, but fixed that in APFS:

https://unicode.org/faq/normalization.html

https://medium.com/@sthadewald/the-utf-8-hell-of-mac-osx-fee...


I enjoyed how the timeline graphic included Joel's article. Because my first thought was hey, isn't this the same title.


And yet, many modern, recent apps can't even encode the accented European character in my given name. Sigh.


> They will look the same (Å vs Å)

No. In my browser the first A has the ring glued to it, and the second has a little gap.


> Unicode is locale-dependent

Well, there is a new fact that I learned and immediately hated.

The fuck were authors thinking...

I am now firmly convinced people developing unicode hate developers. I suspected it before just due to how messy it was (same character having different encodings ? Really ? Fuck you), but this cements it.


Unicode is not locale-dependent, just mapping from graphemes to (font) glyphs is locale/font dependent.


The author shows how to-upper and to-lower change according to locale

But making it clear which glyph to use is also a key feature!


Yeah this is a big problem for me right now trying to pick fonts and characters for CJK. I have a bunch of bugs to fix that will require sending the locale down to the text itemization code.


Well C is locale dependent. And one does not break backwards compatibility with C for fear of badness. So naturally Unicode must be locale dependent too.


> people developing unicode hate developers

Or at least they have a vicious indifference to us. Unicode is a nightmare.


> Since everybody in the world agrees on which numbers correspond to which characters, and we all agree to use Unicode, we can read each other’s texts.

Hmm? I thought some code points combine to create a character. Even accented latin ones can be like that.

Also we need to agree on what is a character.


> Also we need to agree on what is a character.

Indeed. I used to think I knew what a character was until Unicode came around. Now I genuinely don't know with any real certainty.


I am torn between supporting all languages (which easily leaks into supporting emojis) versus just using the 90~ Latin characters as the lingua franca.

Look, I would love to be able to read/write Sanskrit, Arabic, Chinese, Japanese etc and share those content and have everyone render and see the same thing. The problem is that I feel like most of these are:

1. a kind of an open problem 2. very subjective 3. very, very subjective as what you is mostly dictated by the implementation (fonts)

For example, why does a gun emoji looks like water gun? Why is the skull-and-crossbones symbol looks so benign. In fact, it is often used as a meme (see deadass :skull:) Why is the basmala a single "character"?

In my opinion, people should just learn how to use kaomoji. Granted, kaomojis rely on a lot more than the Latin characters but it is at least artful, skillfull and a natural extension of the "actual" languages.

> inb4 languages evolves

Yes, but it mostly happens naturally. I feel like what happens today mostly happens at the whim of a few passionate people in the standard.


> I am torn between supporting all languages (which easily leaks into supporting emojis) versus just using the 90~ Latin characters as the lingua franca.

I don't want to support emoji either (and, I don't want emoji on my computer), although in some cases, if it is really necessary to be supported, they could be implemented just as text characters instead of as colourful emoji, anyways.

For many purposes (e.g. computer codes) ASCII is good enough (and actually even can be better since it avoids the security problems of using Unicode). (Sometimes, character sets other than ASCII can be used, e.g. APL character set for APL programming.)

> Look, I would love to be able to read/write Sanskrit, Arabic, Chinese, Japanese etc

I also would, but Unicode is bad enough that I would use other ways of doing such a thing when possible (even writing my own programs, etc). (If a program insists on Unicode, I might just use ASCII only anyways, or write my own program)

Not everyone necessarily need to see the same thing (if it is a text, rather than pictures of the text), although, the suitable character sets for that language which can be in use (and with fix pitch if necessary, etc), to auto select a suitable fonts for your computer by the reader's preference.

So, I prefer to support all languages (where applicable; sometimes it isn't), without using Unicode.


> “I know, I’ll use a library to do strlen()!” — nobody, ever.

The standard library provided by languages like C, C++ is a library. Features like character strings are present and it's a totally reasonable expectation for the length to give you the cluster count.


No, for C and C++, which are close to the hardware, it's totally reasonable to expect strlen() to give you the byte count. You don't allocate memory for buffers based on the cluster count.

If you want cluster count, call a different function.


Given that strlen() predates Unicode by...30 years(?) - it's not terribly surprising that isn't a viable approach.


>That gives us a space of about 11 million code points. About 170,000, or 15%, are currently defined. An additional 11% are reserved for private use. The rest, about 800,000 code points, are not allocated at the moment. They could become characters in the future.

1.1 million?


Yeah, the author's numbers are off by a "0". It should be "1,700,000" and "8,000,000".


One of my favorite interview questions is simply "What's the difference between Unicode and utf-8?" I feel like that's pretty mandatory knowledge for any specialty but it doesn't get answered correctly very often.


In Java/Kotlin, I've found this Grapheme Splitter library to be useful: https://github.com/hiking93/grapheme-splitter-lite


> The only modern language that gets it right is Swift

The only modern language ((he knows of)) that gets it right to be precise. Ruby also gets it right.

``` [1] pry(main)> "".size => 1 ```


I once bought an O'Reilly book on encoding. It was like 2000 pages. I never read it, that was about 15 years ago. My take away is that encoding is really complex and I just kind of pray it works which most of the time it does.


The number of graphene clusters in a string depend on the font being used. The length of a string should be the number of code points because that is not length specific.

Better yet, there shouldn't be a function called length.


> The only modern language that gets it right is Swift:

Elixir too:

  Interactive Elixir (1.15.4) - press Ctrl+C to exit (type h() ENTER for help)
  iex(1)> String.length "ẇ͓̞͒͟͡ǫ̠̠̉̏͠͡ͅr̬̺͚̍͛̔͒͢d̠͎̗̳͇͆̋̊͂͐"
  4


Is there a way to read this with the mouse cursors disabled? It seems like great content but all the movement on the page is way too distracting.

EDIT: I've never been downvoted for asking a question before. Weird, but okay.


ther article's background color deserves to be named: https://colornames.org/color/fddb29


What's the point of having a separate codepoint for the Angstrom if it's specified to normalize back to the regular "capital A with ring above" codepoint anyway?


> The rest, about 800,000 code points, are not allocated at the moment. They could become characters in the future.

Why is Tengwar still not in Uniclde officially? What's the problem with it?


To save other people the google: Tengwar is probably not in unicode because it is a fictional script from a book.


While U+A66E multiocular O can be found in just one manuscript, and it is still in Unicode: https://en.wikipedia.org/wiki/Multiocular_O


Honestly, I wouldn't have thought that would be an issue to the Unicode folks. They have already allowed things (emoji) that have no place being in the standard, as they aren't even text.


I feel like Apple pushed the consortium to add a ton of useless emojis for whatever their own reasons were.


According to a sibling to what you replied to, it's because the shapes of the glyphs are still under copyright by known-litigious rightsholders and the Unicode consortium doesn't want to subject font authors to that.


I would wonder how many people are here who have never seen Tengwar. I would bet that's a minuscule minority.


I've never even heard of it before.


That's a higher bar than having seen it, I think. I also had to look it up, but as soon as I saw the images in Wikipedia I knew that it's from Lord of the Rings.


It is. But the even higher bar is that you actually write in this script.


Looks like Georgian


The problem with Tengwar (and Klingon) is the problem with a lot of pop culture right now: copyright. The Tolkien Estate still exists and still litigiously upholds what it can of their copyright terms. CBS Viacom (Paramount) still claim a copyright interest in all the written forms of Klingon.

Copyright is not technically violated simply by encoding the characters into a plane such as one of Unicode's, that's an easy open and shut fair use, but Unicode principals have stated they don't want to pass on the copyright burden to font authors either, which would be sued if they tried to paint some of those characters. (Why encode something that fonts aren't allowed to produce?) That should also be fair use, but the law is complicated and copyright still so often today leans in favor of the Estates and major Corporations rather than fair use and the public commons.

(ETA: I'm hugely in favor that "conlang", constructed language, scripts such as these should be encoded by Unicode. I wish someday we fix the copyright problems of them.)


Tengwar is in the Under-ConScript Unicode Registry: <https://www.kreativekorp.com/ucsur/>


The ConScript Unicode Registry is a volunteer project to coordinate the assignment of code points in the Unicode Private Use Areas (PUA). Why does tengwar have to be in the PUA, why not make it a first-class charset? It's not just a minor conlang a small group of geeks invented on a weekend, it's a well-established piece of the modern culture, isn't it?


According to a sibling to what you replied to, it's because the shapes of the glyphs are still under copyright by known-litigious rightsholders and the Unicode consortium doesn't want to subject font authors to that.



> Hell, you can’t even spell café, piñata, or naïve without Unicode.

I must have missed something. All of those symbols are present in Extended ASCII (i.e. 8-bit).


Extended ascii is a bodge, and requires you to set a code page to pick the right set of "extended" characters. Unicode is also a superset of ascii though, so that sentence is right, on a technicality.


Really great article. Hitting all the points I would expect.


Honestly the "what encoding is this! UTF-8" is still the only thing we need to know. len(emoji) is still a corner case that few will care about.


That's what everyone thinks, until the user sticks an emoji in the name field


No emojis (or anything else remotely exotic) in names thanks.

    /^[A-Za-z\-' ]+$/
Users can beg and grovel at my feet for every measly character beyond that puny set that they want allowed in a name field.


My anglophone Canadian brother's name is André. Even if you're fine with alienating the ~50% of the world population using non-latin writing systems, probably best to at least stick to the stuff covered by the latin1 legacy encoding.


Errata: In the table under "How many bytes are in UTF-8?", bottom row, "10000" should be "100000".


Anyone else hate titles like this? There are millions of developers working on a large variety of things. It sounds so arrogant to me.


It references previous famous blogposts. Much like "$THING considered harmful" titles. You'd also have a hard time working with computers in the modern day without running into unicode.


$ ping6 tonsky.me ping6: no address associated with name $

What every software developer must know about IPv6 in 2023 (still no excuses!).


Extended Grapheme Cluster should be understood as Extended (Grapheme Cluster) or as (Extended Grapheme) Cluster?


"Extended (Grapheme Cluster)".

The .graphemes() method in Rust's unicode-segmentation crate takes an is_extended boolean as an argument and, if you set it to false, you're iterating legacy grapheme clusters.


With the benefit of hindsight, would we include the error detection bits of UTF8 if we could choose not to?


Yes. Give https://www.youtube.com/watch?v=_mZBa3sqTrI a watch... especially the "Oh my God! We've been hacked!" part at 36:20.

TL;DR: They had a transient glitch in their network switch and, because Windows uses UTF-16 when sending remote event logs over the wire, whenever it dropped a single byte, it had the effect of swapping the endianness of the messages, resulting in scary Chinese text in the logs.

You could get the same effect by naively applying byte-wise processing to UTF-16 or UTF-32, or having an off-by-one error.

UTF-8 is self-synchronizing so one-byte errors like that only lose you one character, rather than corrupting the entire stream going forward.


Unicode looks like a big over engineered standard that had 50 hands trying to put their mark in


It looks like that because Unicode is trying to solve a problem that everyone thinks is easy until they uncover the true extent of encoding human languages.


How does this explain surrogate pairs?


Surrogate pairs were new to Unicode 2.0. Unicode 1.0 didn't anticipate the need for more than 65,536 code points (who would ever need more?); the main perceived threat to that limit having been resolved by Han unification.


Ok, but that doesn't answer the question; it's more of an indication that those design(at)s didn't uncover "the true extent" until years later


That's because Unicode chose to be a superset of all other encodings, so they've brought everyone else's complexity and tech debt.


Technically, a superset would have to somehow Schrödinger's cat around \ in latin1 and ¥ in Shift-JIS being the same codepoint.

Unicode just took it upon themselves to reliably round-trip legacy text... thus the precomposed forms.

Most of the other complexity and technical debt is in the writing systems themselves.


The article doesn't mention how to resolve string manipulation problem involving locales.


"roll your own"

Rather not. It takes an incredible amount of work to get it right. Just stick to ICU.


Anyone know what the story is behind the "Weird Emoji" around 140000 on the map?


The E0000-E007F block is the "Tags" block, which is used for flag emojis.

But there is not a code for each flag. Instead there is a code for each ASCII character. A flag sequence is formed from U+1F3F4 (Black Flag), followed by at least two tags that form a country/region code, and then U+E007F (End tag).

So, yes this is weird, because the emoji is dependent on the decoder. It was made this way to keep Unicode independent of geopolitics.

Read more: <https://en.wikipedia.org/wiki/Tags_(Unicode_block)>



Reload with Javascript disabled to remove the distracting fake mouse pointers.


I knew that domain, so I had sunglasses at hand before opening the page!


Can there be overlaps between fonts in the private use area?


Yes. "Private" in this case means that you can't expect consistent behavior from one system to the next.


Firefox 120.0a1 has difficult time displaying this page


Best. Explanation. Ever.


If just for the fact that it annoys people I love the mouse cursor idea. But I also find it technically interesting. Is some kind of consent legally needed per GDPR or something for this? I for sure is tracking, literally. And a website has to ask to set cookies ...


Cookie consent is only necessary if you're sharing it with others (eg. ad networks, Google Analytics, etc.) or using it for "non-essential" functions (again, stuff like analytics). Sites just don't want the general public to realize that.

As for the mouse cursors, I don't think they qualify as personal information under the GDPR, but IANAL.


It's needed when you collect the data. Cursor position is being used directly to make the service "function" though. Even if the function it enables is pretty novel. This is entirely debatable, and probably matters less than showing were your cursor is in a google doc.

Regardless, I don't think it matters since the author is not in the EU.


I'm sorry but the website design is extremely distracting. The mouse pointers at least are easy to delete with the inspector; The background color is not the best choice for reading material, but the inexcusable part is the width of the content.

This content must be really awesome for someone to go through the trouble of interacting with such a site.


Guys, you don't need to know that crap.


I don't want to be too full of myself here, but I'm a very skilled and highly paid backend software engineer who knows roughly nothing about unicode (I google what I need when a file seems f'd up), and it's never been a problem for me.

I'm sure the article is good but the title is nonsense.


The title is definitely nonsense. The reality is that for most people, they will never need to know the gritty details of how to encode or decode UTF-8. The article is interesting, but I was pretty put off with how the author led with such a hyperbolic (and untrue) claim.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: