Hacker News new | past | comments | ask | show | jobs | submit login

> But I Want the Length to Be 1! > There’s a language for that

Perl 6: https://docs.perl6.org/routine/chars


And of course, if you want the other length forms, use .codes or .encode('UTF-8').bytes. But internally to Rakudo, an emoji really just one code point, so most of the common string ops are O(1). There's a bit of an optimization if all of the code points fit into ASCII, but otherwise we use synthetic code points to represent all of the composed characters.

This is probably the biggest mystery to me of the Python 3 migration. If they were going to break backcompat, why on Earth didn't they fix Unicode handling all the way? They didn't have to go completely crazy with new syntax like Perl 6 did, but most languages shift too much of the burden of handling unicode correctly onto the programmer.

With Unicode being a moving target I'm not sure any language will truly "fix it all the way": building in things like grapheme-cluster breaking/counting to the language just means the language drifts in and out of "correctness" as the rules or just definitions of new or existing characters change. Of course, this is covered in the article, but when you "clean up" everything such that the language hides the complexity away you can still have people bitten (say, by not realizing a system/library/language update might suddenly change the "length" of a stored string somewhere). Or you could simply have issues because developers aren't totally familiar with what the language considers a "character," as there's essentially no agreement whatsoever across languages on that front (Perl 6 itself listing the grapheme-cluster-based counting as a potential "trap" and noting that the behavior differs if running on the JVM.) I don't think a "get out of jail free card" for Unicode handling is really possible.

The codepoint-based string representation used by Python 3 may be "the worst" (I'm not totally sure I agree) but it's fine. The article's main beef is about the somewhat complex nature of the internal storage and the obfuscation of the underlying lengths.

I mentioned Perl 6's in-RAM storage format.

I didn't seek to mention every programming language for everything. E.g. I didn't mention C#, since UTF-16 was already illustrated using JavaScript.

So does php with `mb_strlen`

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact