That said, "convert everything to your native Unicode format at the edges and reconvert it back out at the edges" is at least a tolerable answer. You still lose things, but it puts you ahead of most programs. But few environments make even that really easy, because it turns out to be difficult to identify all the edges; sure, your web framework may emit and send unicode (and then again it may not...), but did you read files off your disk in the correct encoding? Does your database correctly handle encoding? Does all the other code that ever inputs or outputs anything handle Unicode correctly? Do you ever store something in a system that is really just for storing binary blobs, and forget about the encoding?
It's hard, it's tedious, and from what I've seen it's even harder and more tedious than it has to be because so little of the system is usually built to make it work right, because the people creating all your libraries were either ignorant or perhaps even contemptuous of the issues.
I have often thought about what change I would make in 1970 if I could to fix a lot of modern code. Eliminating the null-delimited buffer is definitely number one, but explaining that there is no such thing as a "string" without an encoding label would be number two. Anywhere I see a "string" in the input or output specification for a function I just cringe.
> That said, "convert everything to your native Unicode format at the edges and reconvert it back out at the edges" is at least a tolerable answer.
It obviously isn't for the ruby developers. If it were so they had chosen utf8 as internal encoding, which they didn't because they didn't consider this a tolerable answer. Even though you can get ruby 1.9 to work this way, this approach could still cause some headache.
I was addressing the complaint that the encoding in Ruby is hard now, and it broke working code. Encoding is fundamentally hard, and if encoding used to be easy it is almost certainly because your old code got it wrong, and your old code probably wasn't working. I emphasize the "probably" because it is faintly possible that your old code really did work and now it really doesn't work, in which case I would understand the frustration, but if I were giving odds on the chance that the old code actually handled everything correctly I'd open the bidding at somewhere around 5:1 for a superstar encoding expert (working in a language with poor encoding labelling support), with the odds getting worse the further from that you get. There are some things that are just hard without language support even for experts.
This, to me, is the fundamental flaw. By now we should be able to have a single encoding that takes up as little space as possible while supporting every known character, and leaving room for more. Most machines now are at least 32-bit... that's almost 4.3 billion characters. Surely there are less than that in the world?
One of my "Grand Lifetime Projects" is to build a new programming language and a new OS built with it. Part of that will certainly be handling strings in an efficient way, both in terms of computer and programmer time. I have some ideas swirling around for creating One True Encoding that allows for extensibility.
That's pretty much UTF-8. If you're going to stuff everything into one encoding, you are going to have to make tradeoffs.
See also UCS-4, which is simply "throw 32 bits at every character". Nobody uses it because it makes everything pretty big. (At least a 3-byte CJK character tends to mean more on average than a single English character.)
If you haven't already, take some time to read over the Unicode standards. It is very enlightening. This is especially true if you want to make the "one true encoding"; gotta know what the bar is to beat, right? There's way more to Unicode than just "Here's a catalog of all possible characters and here's numbers for them all", and there are reasons why there's way more than that.
I am from Korea, and one of the treasure of this country is Tripitaka Koreana, compilation of Buddhist texts carved in the 13th century. It is 52,382,960 characters long, Wikipedia tells me. There is a whole institute devoted to this document. This institute started to encode it in machine readable form from 1993 and completed the first draft in 2000. In the process they discovered 23,385 new letterforms not registered anywhere. There are many such encoding projects yet to be completed. So yeah, Unicode won't cover everything. That is given. And that's okay.
It is also a "public good" issue: the sooner everyone up-converts to 1.9, the easier it will be to develop with Ruby because all required gems will work, etc. I have whined quite a bit on up convert hassles on rubyplanet.net, so I do understand the author's pain + complaints, but we do all need to move forward.
With respect to performance boost: Startup time didn't improve. And IIRC the same is true for certain string operations. Both are important in the field where ruby originated and where I personally still find it most useful -- scripting or rather as perl-replacement.
God speaks latin1, too.
I found this series of articles well worth the time to read and understand:
As I said, it seems to kind of work, but I would like to be sure what is going on.
And btw, here there is a loss since all character sets are not fully covered even in UTF-16 (iirc). I am trying to recall- maybe Matz gave a details reply somewhere on the net.
If they aren't covered in UTF-16, they wouldn't be covered in UTF-8 or UCS-4 either. All modern Unicode encodings (i.e. not UCS-2) can encode exactly the same data.
I'd be curious to know which character sets Unicode doesn't cover yet a different encoding system does.