I don't get your point - you should not care about characters/codepoints, but ra...

pilif · on Feb 9, 2012

Most modern languages such as Python 3 and Ruby 1.9 have you only care about characters and never about the internal representation.

Calling the respective methods to get a strings length for example will always return the length in characters. There is no way to get to the byte length without explicitly naming an encoding you'd like to get the byte length for.

Older languages, like PHP or JS, Python 2 and Ruby 1.8 leak their internal implementations. The methods to retrieve a string length would return the amount of bytes the internal representation of the string requires. If you need the length in characters, you need to call different methods - sometimes even from external libraries.

cygx · on Feb 9, 2012

> Calling the respective methods to get a strings length for example will always return the length in characters.

Most languages return the length in _unicode characters_, which is just the number of codepoints.

However, in most cases, the programmer actually wants the number of user-perceived characters, ie _unicode grapheme clusters_.

UTF-32 has to be treated as a variable-length coding in most cases, no different from UTF-16 - otherwise, you'd miscount even characters common in western languages like 'ä' if it happens that the user used the decomposed form.

Even normalization doesn't help with that, as not all grapheme clusters can be composed into a single codepoint.

Perl6 is an example of a language which does the right thing here: Its string type has no length method - you have to be explicit if you want to get the number of bytes, codepoints or grapheme clusters.

To add some confusion back in, the language also provides a method which gets the number of 'characters', where the idea of what a character is can be configured at lexical scope (it defaults to grapheme cluster).

masklinn · on Feb 9, 2012

> To add some confusion back in, it also provide a method which gets the number of 'characters', where the idea of what a character is can be configured at lexical scope (it defaults to grapheme cluster).

That's actually pretty cool, as it lets the library configure itself for the representation which makes most sense to it: a library which deals in storage or network stuff can configure for codepoints or bytes length, whereas a UI library will use grapheme clusters for bounding box computations & al.

Configuring it lexically also makes sense as it avoid leaking that out (which dynamically scoped configuration would).

cygx · on Feb 9, 2012

I agree that this flexibility is nice to have, and the comment was a bit tongue-in-cheek, because reading the spec felt like this:

length: This word is _banned_. Evil. DO NOT USE!

chars: Same thing as length, just with a new name oO

alinajaf · on Feb 9, 2012

Interesting, did not know about grapheme clusters. More info on unicode.org:

http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

masklinn · on Feb 9, 2012

> Most modern languages such as Python 3

Not true until Python 3's "flexible strings" implementation. Unless you're using "wide" builds (which use UTF-32 internally), which are already available in Python 2 and are not the default representation.

thristian · on Feb 9, 2012

I'm pretty sure every Linux distribution's official Python packages are wide builds. Certainly, this is the case on Ubuntu and Debian, and I think Red Hat as well.

masklinn · on Feb 9, 2012

> I'm pretty sure every Linux distribution's official Python packages are wide builds.

That's a matter of Linux distributions packaging (again, by default, without any specific configuration, Python will set itself up using narrow builds), and if you assume wide builds your code is broken.

Furthermore, pilif asserted a difference between Python 2 and Python 3. There is no such thing prior to the yet-unreleased Python 3.3 as making wide builds the default was explicitly rejected for Python 3, Python 2 and Python 3 behave exactly the same way on that front (again, prior to Python 3.3)

Of course pilif is also wrong in asserting that "The methods to retrieve a string length would return the amount of bytes the internal representation of the string requires.", Python < 3.3 returns the number of code units making up the string (never the number of bytes for the the unicode string types — str/bytes is a different matter as it's not unicode data)

Nick_C · on Feb 11, 2012

FYI, Slackware isn't, it is narrow build.