Hacker News new | past | comments | ask | show | jobs | submit login

That's silly. How often have you had to work with grapheme clusters without also using a text rendering engine? But the number of times you need to know the number of bytes a string takes, even when using scripting languages, is much higher. The only way to deal with this is to not make assumptions, and not have a string.size() function, but specific accessors for 'size in bytes', 'number of code points' and (potentially, if the overhead is warranted) 'nr of grapheme clusters'.

The 'fundamental' problem here is that the average programmer doesn't understand 'strings' because it seems so easy but it's actually very hard (well, not even hard, just big and tedious). Even more so now that many people can have careers without really knowing about what a 'byte' is or how it relates to their code.

> How often have you had to work with grapheme clusters without also using a text rendering engine?

All the time. Want to truncate user text? You need grapheme clusters. Reverse text? Grapheme clusters. Get what the user thinks of as the length of text? Grapheme clusters. Not saying it’s a good idea to make them any sort of default because of that, though; you’re right that it should be explicit.

Truncating text is almost always (in my experience) a UI thing, where you pass a flag to some UI widget saying 'truncate this and that on overflow' and while rendering, it can then truncate using grapheme clusters.

How often does one reverse text? And when do users care about text length? Almost always (again, in my experience) in the context of rendering - when deciding on line length or break points, so when you know and care about much more than just 'the string' - but also font, size, things you only care about in the context of displaying. Not something that should be part of the 'core' interface of a programming language.

I mean I think we agree here; my point was that I too used to think that grapheme clusters mattered, but when I started counting in my code, it turned out they didn't. Sure I can think of cases where it would theoretically would matter, but I'm talking about what do you actually use, not what do you think you will use.

I’m biased towards websites, but truncating text server-side to provide an excerpt is something I need to do pretty often. Providing a count of remaining characters is maybe less common, but Twitter, Mastodon, etc. need to do it, and people expect emoji (for example) to count as one.

Plus sometimes you’re the one building the text widget with the truncation option.

Twitter's count of "characters" is code points after normalization[1].

I don't know who expects emoji to count as one character, but they'd be surprised by Twitter's behavior: something like ‍[2] counts as 4 characters (woman, dark skin tone, zero width joiner, school).

[1] https://developer.twitter.com/en/docs/basics/counting-charac... [2] https://emojipedia.org/female-teacher-type-6/

And when do users care about text length?

When validating many types of data submitted via the web.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact