Note that this is only really true with Python 3.3 and later as in earlier versions stuff would start breaking for characters outside of the BMP (which is where JS is still stuck at, btw) unless you had a wide build which was using a lot of memory for strings (4 bytes per character)
In general, internally using unicode and converting to and from bytes when doing i/o is the right way to go.
But: Due to http://en.wikipedia.org/wiki/Han_unification being locked into Unicode with a language might not be feasible for everybody - especially in Asian regions, Unicode isn't yet as widely spread and you still need to deal with regional encodings, mainly because even with the huge character set of Unicode, we still can't reliably write in every language.
Ruby 1.9 and later helps here by having many, many string types (as many as it knows encodings), which can't be assigned to each other without conversion.
This allows you to still have an internal character set for your application and doing encoding/decoding at i/o time, but you're not stuck with unicode if that's not feasible for your use-case.
People hate this though because it seems to interfere with their otherwise perfectly fine workflow ("why can't I assign this "string" I got from a user to this string variable here??"), but it's actually preventing data corruption (once strings of multiple encodings are mixed up, it's often impossible to un-mix them, if they have the same characer width).
I don't know how good the library support for the various Unicode encodings is in Ruby though. According to the article, there still is trouble with correctly doing case transformations and reversing them.
Which brings me to another point: Some of the stuff you do with strings isn't just dependent on string encoding, but also locale.
Uppercasing rules for example depend on locale, so you need to keep that into account too. And, of course, deal with cases when you don't know the locale the string was in (encoding is hard enough and most of the cases undetectable - but locales - next to impossible).
I laugh at people who constantly tell me that this isn't hard and that "it's just strings".
What does it gets right????? It's all broken as nearly everything else!
It's sad 99% comments there are “oh see, I can run some examples from page just fine. So everything's all right, I've got full Unicode!”
The reality is there's 1-2 languages that are trying to make it correct from the beginning (perl6, I'm looking at you). It's 2013 and if language can compose bytes to code points everyone declares a win, sticks "full unicode support" label to it and continues to str[5:9].
”But I've got UnicodeUtils!” — it won't help. People just don't want or cannot write it correctly. Word is not [a-z]. Not [[:alpha:]] either. And not [insert regex here]. You cannot reverse by reversing codepoint list. And you cannot reverse by reversing grapheme list. And string length is hard to compute and then it doesn't any sense. And indexing string doesn't make any sense and it's far away from O(1)
Between strings being native unicode code points (you have to encode to bytes to get UTF-8) and unicodedata for normalization and decomposition (http://docs.python.org/3.3/library/unicodedata.html) I've found Python 3 pretty robust. Python 3.3 also uses appropriate Unicode data for regular expressions, as mentioned on http://docs.python.org/3.3/howto/regex.html.
If you want to compare strings you should really normalize them first, which is where unicodedata comes in. In my programming situations it would be wrong to conflate different decomposition of the same unicode string. Why is this? Because other software you interact with uses encodings and the UTF-8 encoding of two different decompositions if different. I've run into this with UTF-8 filenames on OS X when working with Subversion.
Python 3.3.2 (default, Nov 27 2013, 20:04:48)
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Edit: python version
Taking the time to actually prove your point it useful. However, your recent example seems to be running fine on Python 3.3. You did not include any version info in your example output.
Python 3.3.0 (default, Mar 11 2013, 00:32:12)
[GCC 4.7.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
A little googling turned this up. https://mail.python.org/pipermail/python-ideas/2013-July/021...
x = 'o\u0308o\u0327'
len(x) == 4
x == "öo̧"
x[:2] == "ö"
x[2:] == "o̧"
x == x == "o"
import regex as re
import regex as re
decomposed_str = 'o\u0308o\u0327'
graphemes = re.findall('(\\X)', decomposed_str)
sub_graphemes = grapheme[1:]
decomposed_substr = ''.join(sub_graphemes)
[edit: Python 3.2.3]
[edit: [GCC 4.7.2] on linux2]
>>> 'öo̧'[1:] #copy-paste
>>> 'öo̧'[::-1] # "reverse" also breaks
#But for Japanese:
# And Norwegian
# And a few "French" characters (in this case
# manually typed as alt+~+e, etc
# And crucially for your example, typed as
[edit2: I gather, that working with "pre-composed" characters work, and working with "de-composed" ones break. Which, while expected, is a little sad, I agree.]
One of the biggest things that I feel Python gets right with the string type is that strings are immutable. It makes a lot of things easier.
It really makes sense to have a good string type for small strings, stored in unicode. Immutability makes everything simpler.
The string type is not a good fit for handling large amounts of text. There are trade offs for efficiency that have to be made to create a handy string type. It really makes sense to have a separate "bytes" type or some kind of StringBuffer for doing big text operations.
one can create mutable strings as well). Lua also uses immutable strings. In Java and C# I think the situation is the same, since if you want
to use high performance string manipulation, you'll generally resort some form of StringBuilder helper class.
s = "hello"
s << " world"
s # hello world
str << " world"
a = "hello"
a #=> "hello world"
a = a + " world"
I'm not sure what "internally using unicode" means. Pyhon's internal representation of strings has changed a lot. It hasn't even been stable in Python 3. Now they are apparently using an internal representation that varies depending on the "widest" character stored.
The only solution that isn't driving me insane is to use UTF-8 everywhere. The Python 3 unicode situation is actually the main reason why I'm not using Python much these days.
If you want to work with strings, you work with strings. If you want to work with bytes, you work with bytes. If you want to convert bytes into strings (maybe because it's user input that you want to work with), then you tell Python what encoding these bytes are in and you have it create a string for you. You don't care what Python uses internally, because their string API is correct and correctly works on characters.
That noël example of the original article consists of 4 characters in Python 3 which is exactly what you want.
I know that just using UTF-8 everywhere would be cool, but that's not how the world works for various reasons. One is that UTF-8 is a variable length encoding which has some performance issues for some operations (like getting the length of the string. Or finding the n-th character).
UTF-8 also isn't widely used by current operating systems (Mac OS and Windows use UCS-2). It's also not what's used by way too many legacy systems still around.
So as long as the data you work with likely isn't in UTF-8, the encoding and decoding steps will be needed if you want to be correct. Otherwise, you risk mixing strings in different encodings together which is an irrecoverable error (aside of using heuristics based on the language of the content).
I do need to know and I always care. My requirements may be different than those of most others because I write text analysis code and I need to optimize the hell out of every single step. I shiver at the thought that any representation could be chosen for me automatically.
Of course, nothing is stopping me from simply using the bytes type instead of str, but clearly the Python community has decided to go down a road I feel is entirely wrong so I'm not coming along.
>I know that just using UTF-8 everywhere would be cool, but that's not how the world works for various reasons. One is that UTF-8 is a variable length encoding which has some performance issues for some operations (like getting the length of the string. Or finding the n-th character).
I'm bound to live in a variable length character world unless I decide to use 4 byte code points everywhere, which is prohibitive in terms of memory usage. Memory usage is absolutely critical. Iterating over a few characters now and then to count them is almost always negligible.
The need to index into a string to find the nth character only comes up when I know what I'm looking for. Things like parsing syntax or protocol headers come to mind, and they are always ASCII. I don't remember a situation where I needed to know the nth character of some arbitrary piece of text and repeat that operation in a tight loop.
If one day I find myself in such a situation I will have to convert to an array of code points anyway.
I don't see how this equates to a general purpose language failing at strings, especially when the language isn't particularly focused on performance and optimization. And if memory usage is of concern, I would certainly think anything like Python and Ruby would be out of the running?
And I don't see where I said it did.
I used to favor a dual Python/C++ strategy, but Python's multithreading limitations and the decisions around unicode have convinced me to move on. It's not like anything has gotten worse in Python 3, it's just that there has been a major change and the opportunity to do the right thing was missed.
I happen to think that UTF-8 everywhere is the right way to go, not just for my particular requirements, but for all applications, because it reduces overall complexity.
and I'd like to know what do you think the "right thing" would be
I agree that only using UTF-8 would be the right thing, but only if you don't want to have "array of codepoints"... the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings...
there're some weird exceptions, like Haskell Data.Text (I think that's due to haskell laziness)
would you prefer to have O(n) indexing and slicing of strings... or you'd prefer to get rid of these operations altogheter?
if the latter, what'd you prefer to do? force the developers to use .find() and handle such things manually... or create some compatibility string type restricted to non composable codepoints?
Getting an implementation out to see it used in the wild might be an interesting endeavor... probably it'd be easier to do in a language that allows you to customize it's reader/parser... like some lisp... clojure
Then we agree entirely. I want all strings to be UTF-8. Period. What I said about an array of codepoints was that I would create one seperately from the string if I ever had a requirement to access individual code point positions repeatedly in a tight loop.
>the problem is: every language, and every developer expect to be able to have random access to codepoints in their strings
>would you prefer to have O(n) indexing and slicing of strings
I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.
Actually, you can in Python... and obviously most developers ignore such issues 
My point is that most developers don't know these details, a lot of idioms are ingrained... get them to work with string types properly won't be easy (but a good stdlib would obviously help immensely in this regard)
> I would leave indexing/slicing operators in place and make sure everyone knows that it works with bytes not codepoints. In addition to that I would provide an O(n) function to access the nth codepoint as part of the standard library.
Ok, so with your proposal an hypothetical slicing method on a String class in a java-like language would have this signature?
byte slice(int start, int end);
I've been fancying the idea of writing a custom String type/protocol for clojure that deals with the shortcoming of Java's strings... I'll probably have a try with your idea as well :)
No, you can only get random access on codepoints which will break text as soon as combining characters are involved. Even if you normalize everything beforehand (which most people don't do) as not all possible combinations have precomposed forms.
Unicode makes random access useless at anything other than destroying text.
> but a good stdlib would obviously help immensely in this regard
Which is extremely rare, and which Python does not have.
You are right (apart from combining characters as masklinn explained), but as I said, that's only possible if an array of 32 bit ints is used to hold string data or if it can be guaranteed that there are no characters from outside ASCII or BMP. If I understand PEP 393 correctly, what Python 3.3 does is to use 32 bit ints to hold the entire string if even one such code point occurs. So if you load a (possibly large) text file into a string and one such code point exists then the file's size is going to quadruple in memory. All of that is done just to implement one very rare operation efficiently.
Which parts of Mac OS? You'd have a lot of problems with Emoji support if that were true. To the best of my knowledge, it's UTF-16 everywhere.
Or do you actually mean Mac OS as in Mac OS 9, and not OS X?
"because their string API is correct"
Apparently they have a bug in their UTF-7 parser that can lead to invalid unicode strings. Don't know if it's already fixed.
and it has been fixed since more than 1 month, just 2 days after it was reported
Let's avoid spreading fud, shall we? :)