

A rant about Ruby 1.9 String encoding - dboyd
http://github.com/candlerb/string19/raw/47b0cba0a2047eca0612b4e24a540f011cf2cac3/soapbox.rb

======
pilif
Rants about strings and character sets that contain words of the following
spirit are usually neither correct nor worth of any further thought:

    
    
      > It's a +String+ for crying out loud!  What other
      > language requires you to understand this
      > level of complexity just to work with strings?!
    

clearly the author lives in his ivory tower of english language environments
where he is able to say that he "switched to UTF-8" without actually really
have done so because the parts of UTF-8 he uses work exactly the same as the
ASCII he used before.

But the rest of the world works differently.

Data can appear in all kinds of encodings and can be required to be in
different other kinds of encodings. Some of those can be converted into each
other; some Japanese encodings (Ruby's creator is Japanese) can't be converted
to a unicode representation for example.

Also, I'm often seing the misunderstanding that "Unicode" is a string
encoding. It's not. UTF-(8|16) is. Or UCS2 (though that one is basically
broken because it can't represent all of Unicode).

Nowadays, as a programming language, you have three options of handling
strings:

1) pretend they are bytes.

This is what older languages have done and what ruby 1.8 does. This of curse
means that your application has to keep track of encodings. Basically for
every string you keep in your application, you need to also keep track what it
is encoded in. When concatenating a string of encoding a to another string you
already have that is in encoding b, you must do the conversion manually.

Additionally, because strings are bytes and the programming language doesn't
care about encoding, you basically can't use any of the built-in string
handling routines because they assume each byte representing one character.

Of course, if you are one of these lucky english UTF-8 users, getting data in
ASCII and english text in UTF-8, you can easily "switch" your application to
UTF-8 by still pretending strings to be bytes because, well, they are. For all
intents and purposes, your UTF-8 is just ASCII called UTF-8.

This is what the author of the linked post wanted.

2) use an internal unicode representation

This is what Python 3 does and what I feel to be a very elegant solution if it
works for you: A String is just a collection of Unicode code points. Strings
don't worry about encoding. String operations don't worry about it. Only I/O
worries about encoding. So whenever you get data from the outside, you need to
know what encoding it is in and then you decode it to convert it to a string.
Conversely, whenever you want to actually output one of these strings, you
need to know in what encoding you need the data and then encode that sequence
of Unicode code points to any of these encodings.

You will never be able to convert a bunch of bytes into a string or vice versa
without going through some explicit encoding/decoding.

This of course has some overhead associated with it, as you always have to do
the encoding and because operations on that internal collection of unicode
code points might be slower than the simple array-of-byte-based approach.

And whenever you receive data in an encoding that cannot be represented with
Unicode code points and whenever you need to send out data in that encoding,
then, you are screwed.

This is a defficiency in the Unicode standard. Unicode was specifically made
so that it can be used to represent every encoding, but it turns out that it
can't correctly represent some Japanese encodings.

3) Store an encoding with each string and expose the strings contents and the
encoding

This is what ruby 1.9 does. It combined methods 1 and 2: It allows you to
chose whatever internal encoding you need, it allows you to convert from one
encoding to the other and it removes the need to externally keep book of every
strings encoding.

You can still use the languages string library functions because they are
aware of the encoding and usually do the right thing (minus, of course, bugs)

As this method is independent of the (broken?) Unicode standard, you would
never get into the situation where just reading data in some encoding makes
you unable to write the same data back in the same encoding as in this case,
you would just create a string using this problematic encoding and do your
stuff on that.

Nothing prevents the author of the linked post to use ruby 1.9's facility to
do exactly what python 3 does (of course, again, ignoring the Unicode issue)
by internally keeping all strings in, say, UTF-16. You would transcode all
incoming and outgoing data to and from that encoding. You would do all string
operations on that application-internal representation.

A language throwing an exception when you concatenate a Latin 1-String to a
UTF-8 string is _a good thing_! You see: Once that concatenation happened by
accident, it's really hard to detect and fix.

At least it's fixable though because not every Latin1-String is also a UTF-8
string. But if it so happens that you concatenate, say Latin1 and Latin8 by
accident, then you are really screwed and there's no way to find out where
Latin1 ends and Latin8 begins.

In todays small world, you _want_ that exception to be thrown.

Conclusion

What I find really amazing about this complicated problem of character
encoding is the fact that nobody feels it's complicated because it usually
just works - especially method 1 described above that has constantly being
used in years past and also is very convenient to work with.

Also, it still works.

Until your application leaves your country and gets used in countries where
people don't speak ASCII (or Latin1). Then all these interesting problems
arise.

Until then, you are annoyed by every of the methods I described but method 1.

Then, you will understand what great service Python 3 has done for you and
you'll switch to Python 3 which has very clear rules and seems to work for
you.

And then you'll have to deal with the japanese encoding problem and you'll
have to use binary bytes all over the place and have to stop using strings
altogether because just reading input data destroys it.

And then you might finally see the light and begin to care for the seemingly
complicated method 3.

Sorry for the novel, but character encodings are a pet-peeve of mine.

~~~
ziggurism
I liked your response a lot. Thank you. And I was almost persuaded by it. In
fact I was persuaded for about 5 minutes after reading it. But at the last
second, a thought occurred to me: if there's a deficiency in Unicode that
prevents its use for Japanese, isn't the right solution to just fix that
deficiency?

I mean, Unicode is meant to be an abstract representation of glyphs, separate
from any encoding, that works for all of Earth's languages. It's tailor made
to be a programming language's internal representation of a string. This is
its raison d'etre.

So it seems to me that #2 is definitely The Right Way™ and that if there's
some problem with Unicode that has kept Ruby from adopting it, they should
have worked on fixing it, rather than breaking Ruby. OK, "break" is probably
too strong a word for the state of Ruby 1.9. And in the real world, fixing an
international politicized standard like Unicode is probably impossible. So I
can see that this pragmatic solution might have been the only one available.
But still, it seems wrong to me.

Out of curiosity, what exactly is the deficiency in Unicode that caused Matz
to go with option 3? I presume there are epic flamewars all over the internet
about this issue, but I just haven't been paying close enough attention.

------
halostatue
I know nothing about the author, but there are some statements made that
suggest that the author hasn't had to deal with the wild-and-woolly reality of
encodings out there in a lot of extant data. One only _wishes_ that all data
were UTF-8.

What Ruby 1.9 gets absolutely right is that its String implementation is
completely encoding agnostic (by which I _specifically_ mean that it doesn't
force your data to be encoded in a particular way). There are encodings for
which there is no safe UTF-8 roundtrip (you can successfully convert the data
to UTF-8 nicely, but when you convert back to UTF-8 to that encoding, you
won't get the original input back; you'll get a slightly different output).

Rubyists in Japan don't have the luxury of dealing with Unicode all the time;
they still get lots of data in ShiftJIS and other encodings. (The same is true
of Rubyists elsewhere, but since US-ASCII is a proper subset of UTF-8, most
folks don't know the difference; Win1252 is a pain in the ass, though.) If you
have to do ANY work with older data formats, you curse languages that force
you to use UTF-8 all the time instead of letting you work with the native
data.

Most developers don't think about i18n nearly enough in any case; there's a
lot more to worry about that simply using Unicode doesn't solve for you. Even
the developers of Ruby have to worry about the fact that LATIN SMALL LETTER E
WITH ACUTE (U+00E9) is the same as LATIN SMALL LETTER E (U+0065) COMBINING
ACUTE ACCENT (U+0301); it doesn't begin to address the capitalization of 'ß'
('SS', which isn't necessarily reversible) or that in Turkish 'ı' capitalizes
to 'I', but 'i' capitalizes to 'İ'. Don't EVEN get me started on number
formatting...

EDIT: Added the last paragraph.

~~~
pvg
The roundtrip thing is an edge-case that doesn't really justify inflicting the
non-deterministic pain on everyone. Python 3 and Java have taken the 'one true
internal encoding' path and while hardly free of warts, it's an approach that
is practically saner. The alternative is making some people's hell everyone's
hell, forever.

~~~
halostatue
"Hardly free of warts" doesn't even begin to cover the pain that's dealt with
if you have to deal with these external encodings.

And, if you've got loads of data in an encoding that doesn't roundtrip, it's
hardly an edge case.

Ruby's implementation is supposed to be such that if you _want_ UTF-8 support
and know that your (text) inputs and outputs are always going to be UTF-8, you
never have to think anything differently than you did in Ruby 1.8. If it isn't
working that way, then I think there's a bug.

------
tenderlove
Dealing with string encoding is sometimes a PITA. But I think as English
speakers, we are usually sheltered from the problem because most programming
languages are English centric. I'd like to hear opinions from people who don't
speak English.

If you think the way 1.8 handles (or doesn't handle) encoding is just fine,
try things you typically do, but with a different language.

For example:

    
    
      ruby -ryaml -e'p YAML.dump("こんにちは！")'
    
    

You might also try things like inserting in to a database, parsing documents,
etc.

IMO, not having an encoding associated with some text sucks if you're a non-
english speaker.

~~~
papachito
> I'd like to hear opinions from people who don't speak English.

I'm sure people who don't speak English will read that and answer you straight
away ;)

------
pkulak
I agree. I'd say a good 5% of development time on a new Ruby 1.9 project of
mine has been spent dealing with strings. I've taken to the idea that as long
as _everything_ is UTF-8, then I'll be okay, but good luck enforcing that!
Especially since the whole word seems to default to ASCII, while actually
_using_ multi-byte chars anyway. I think I've got it now, but no, not really.
It just works on my development machine. On my server, where it actually
counts, I'm getting garbage where I should be getting an accent. I've spend
two days now just trying to figure out how to debug something like that!

------
harpastum
This "rant" is just a part of an effort by the author to catalog the behavior
of strings in ruby 1.9. You can see the whole project here:
<http://github.com/candlerb/string19>

The main (runnable) documentation file is here:
<http://github.com/candlerb/string19/blob/master/string19.rb>

------
alextgordon
It seems the most obvious solution is to store strings in a standard encoding
(say UTF-8) and to always convert strings to it at the time of their creation.
Is there a technical reason why Ruby doesn't do this?

~~~
halostatue
The technical reason is that it's a stupid idea. Not all encodings can be
safely round-tripped through UTF-8, which means you can end up losing some
data. (Consider [http://homepage1.nifty.com/nomenclator/perl/ShiftJIS-
CP932-M...](http://homepage1.nifty.com/nomenclator/perl/ShiftJIS-
CP932-MapUTF.html) as a quick example: "Actually, 7915 characters in CP-932
must be mapped to 7517 characters in Unicode. There are 398 non-round-trip
mappings.")

Loss of text data is bad.

~~~
3pt14159
General question: Why didn't the UTF-8 boys and girls make it safe in the
first place? This doesn't sound like rocket science. "This character maps to
that character, this character to that one." I don't understand how we have
unicode snowmen, but we can't safely round trip characters.

~~~
halostatue
Some of it has to do with Han Unification
(<http://en.wikipedia.org/wiki/Han_unification>).

Mostly, though, it's because some of these characters are overloaded. If
you've got a Windows system, go into the DOS window and type "chcp 932" (you
may need the Japanese language files installed). When you type '\', you'll get
'¥' (making "C:\Program Files\" look like "C:¥Program Files¥").

In the systems where what become CP932 were first used, the backslash wasn't
necessary in Japanese, so that character point was used to encode the yen
symbol. Other systems used the backslash, so it was encoded as a different
point. When JIS unified the existing Japanese code pages, it couldn't very
well go back in time to change all that old data, so it merged the two
encodings on many things. So, there's only one Unicode codepoint for the yen
glyph ¥, but in this one encoding there's two different characters for it.

This is the most blatant example of a problem with Unicode transcoding, but as
far as I know, it's not the only one.

See [http://www.mail-archive.com/linux-
utf8@nl.linux.org/msg02337...](http://www.mail-archive.com/linux-
utf8@nl.linux.org/msg02337.html) for what could be done, but probably won't.

------
mtarnovan
Having given up porting a medium-sized Rails application to Ruby 1.9 just last
week I though I'd share some of my experience. Ruby 1.9's String
implementation is fine, but will require lots of changing to existing code.
Rails 2.3.5 is definitely _not_ ready for Ruby 1.9 and UTF-8. For example some
regexps in helpers are not encoding aware, rack 1.0.1 which is required by
2.3.5 doesn't play nicely with encodings and many more small and annoying
problems (template encoding for example; the app I was porting mostly uses
HAML which supports Ruby 1.9's string encodings in the latest versions; but no
luck for the few erb templates)

All in all, this is a huge transition which will take a while to propagate
through the hole Rails stack.

------
vegai
My sup (on ruby 1.9) just _crashed_ on me because I dared to try to write a
mail with UTF8 in it.

This is not a trivial screwup! This is the sort of screwup that should make
everybody who's using that wretched platform think thrice before continuing to
use it.

------
smallpaul
I believe that Python, Java, C#, Objective-C and Javascript _all_ have the
same basic approach to this problem. The Ruby way is better for handling some
Japan-specific problems. But that's at the cost of making life harder and less
predictable for everyone else.

It's a pretty straightforward tradeoff. Of course people who are not Japanese
will naturally be upset to pay a cost in complexity for a feature of benefit
primarily to a programmers from a single country. Non-Japanese Ruby
programmers will just have to decide whether their solidarity with Japanese
programmers outweighs their personal and collective inconvenience.

------
AdamN
It seems like this would be a problem on Python as well, no?

~~~
halostatue
Python "cheats" by converting everything to Unicode internally. It seems like
a simple solution, but it's not a solution since not everything can be
converted safely _back_ to the original encoding.

~~~
ominous_prime
Python 3 doesn't use an internal encoding, it uses unicode. Python 2 can use
unicode strings if you specify them. I'm not familiar enough with ruby to
comapare the two, but I don't see a problem with how python handles it. You
sometimes need to know about your string encoding, which is just a fact of
modern programming.

[EDIT] I think I missed my point slightly. Python 3 doesn't _change_ the
encoding of strings, it decodes them to unicode. You can encode the string
back to the original encoding without loss.

~~~
halostatue
If a CP932 '\' is interpreted as '¥', but is exported as CP932 '¥', there's
data loss. Unless Python 3 keeps the original data around when it converts to
Unicode (probably UTF-8 or UTF-16), there _will_ be data loss in those cases.
It's unavoidable.

~~~
ominous_prime
UTF-\d is not unicode. Unicode isn't an encoding, it's the decoded
representation of a character encoding.

    
    
        [edit] clarified example
        >>> s = u'\xa5' # shiftjis decoding of \
        >>> print s.encode('shiftjis')
        \
        >>> print s.encode('utf-8')
        ¥

------
kule
He has posted some alternatives too

[http://github.com/candlerb/string19/blob/master/alternatives...](http://github.com/candlerb/string19/blob/master/alternatives.markdown)

The first suggestion seems like the logical solution to me however I don't
need to deal with this stuff on a day-to-day basis...

------
old-gregg
Extrapolating, the same rant applies to all duck-typed languages: the number
of possible outcomes for a=b+c explodes depending on types and contents of a,b
and c, therefore his first assumption about one dimensional space is
incorrect.

This is why most C++ teams prohibit their members to overload operators.

~~~
zaphar
Only if your static types for strings include a seperate type per encoding.
Most languages including the ducktyped just use a single encoding internally
"utf-8" for example. So the String type is always compatible. For Ruby though
it sounds like the string type could be any encoding. That to me sounds
dangerous. It's not caused by the language being ducktyped. It's caused by the
language having the same type with multiple behaviours. Ruby made strings into
a mine field for 99% of programmers and safer for 10%

Not sure that's a good ratio.

------
rue
AOL. 1.9's encoding support feels thoroughly wrong to me.

------
vegai
"even Ruby's creators, who are extremely bright people"

There's a hidden "it turns out that" in this sentence.

