

Ruby 1.9 Encodings: A Primer and the Solution for Rails - knowtheory
http://yehudakatz.com/2010/05/05/ruby-1-9-encodings-a-primer-and-the-solution-for-rails/

======
rubyrescue
_Take note of this article, don't think it doesn't apply to you just because
you don't have an internationalized rails app._

i totally respect the need in the ruby community to support different
charsets, but i wish ruby 1.9.1 had a mode where we could say
DEFAULT_CHARSET_IN_CASE_OF_INCOMPATIBLE_ENCODING= "utf-8" and then it would
convert to that encoding when in doubt. Would totally solve our issues.

details:

We just moved a fairly large Rails 2.3.4 app to 1.9.1 this week and have run
into a number of 'incompatible charset encoding' issues.

My client is running a site that sends middle american students to middle
american colleges. There's not a foreign character or thought in sight (which
is a separate issue). However, our content writers occasionally use MS Word
and a "smart quote" slips into the uploaded content.

When that happens, our rails HAML layout is in one encoding, and the partial
that renders the smart-quote-bomb text from the database are now in different
encodings, and the page errors.

The solution is to write a custom sanitize() method on string that forces
encodings and gsubs away smart quotes. The consequences of missing one in
1.9.1 are just higher than in 1.8.7.

------
albertzeyer
It is said:

> However, this solution does not work very well for the Japanese community.
> For a variety of complicated reasons, Japanese encoding, such as SHIFT-JIS,
> are not considered to losslessly encode into UTF-8. As a result, Ruby has a
> policy of not attempting to simply encode any inbound String into UTF-8.

> This decision is debatable, but the fact is that if Ruby transparently
> transcoded all content into UTF-8, a large portion of the Ruby community
> would see invisible lossy changes to their content. That part of the
> community is willing to put up with incompatible encoding exceptions because
> properly handling the encodings they regularly deal with is a somewhat
> manual process.

Does someone has any sources about this?

Esp, why is SHIFT-JIS important? Is Unicode not capable of encoding the full
Japanese language? What information would you loose by encoding with Unicode?

And why can't Unicode be extended to encode this missing information?

This seems to be a much better solution than this encoding nightmare. Esp in a
language like Ruby, I want to have it simple and straightforward.

~~~
guns
> Does someone has any sources about this?

Try this for example: <http://tclab.kaist.ac.kr/~otfried/Mule/unihan.html>

> Esp, why is SHIFT-JIS important? Is Unicode not capable of encoding the full
> Japanese language? What information would you loose by encoding with
> Unicode?

The problem is that SHIFT-JIS, and other regional east asian encodings, codify
variations of characters that Unicode combine into one. The Japanese in
particular dislike this conflation of characters that they feel should be
separate. This process of combining characters was called Han Unification:

<http://en.wikipedia.org/wiki/Han_unification>

Now imagine that you're a Japanese user who feels that three similar ideograms
are in fact sematically different, but a standards body has decided that the
differences are merely stylistic and has combined them into one form: I'm sure
you wouldn't be as willing to lay your language at the feet of Unicode and
swear allegiance to the one true encoding.

Also consider that UTF-8 is optimized for efficient encoding of Western
languages: a japanese text may be in practice four to eight times the size of
a sematically similar version in a western language. SHIFT-JIS presumably does
not suffer from this problem.

> This seems to be a much better solution than this encoding nightmare. Esp in
> a language like Ruby, I want to have it simple and straightforward.

Except that languages are not really that simple or straightforward. I
appreciate the flexibility of Ruby 1.9's encoding system. The only thing
really broken right now are the tools that wycats mentions in this post.

~~~
albertzeyer
Thank you for the information and the links.

But it still seems to me that extending Unicode to also include those
variatons of characters is a better and more clean solution than these
workarounds.

In your link, the author speaks also about some fonts being incomplete. The
right solution about this would be to fix the fonts, not to switch to another
encoding with its own font.

Having one encoding (Unicode) that is able to encode just everything would
simplify everything.

Btw., you are speaking about space efficiency. Afaik, most Asian characters
can be encoded as 2 bytes in UTF-8. You cannot get it much better. And the
space required for text is in most cases much smaller compared to other media.
Also, if there is much redundancy in it, it can easily be compressed (also in
some transparent way if needed, like all Ruby strings with more than 32kb are
automatically compressed internally or so).

~~~
guns
> Afaik, most Asian characters can be encoded as 2 bytes in UTF-8

I overstated my case for sure. The Unified ideograms are four words wide, but
the non-unified extensions are larger, iirc. And while disk space is hardly a
problem anymore, you might see why programmers from 10 years ago may have made
different choices.

> But it still seems to me that extending Unicode to also include those
> variatons of characters is a better and more clean solution than these
> workarounds.

I certainly can't argue with that. But Unicode is a standard, and real
problems don't have time to wait around for standards bodies. In the eyes of
many East Asian organizations, Unicode is broken now, and so the burden falls
on the programmer.

Even here in the US, there are tons of data sitting around in tables encoded
in Windows-1251 and ISO-8859-1. Having had to deal with UTF-8 and Latin-1
mismatches in the past, I don't find ruby1.9's encoding all that onerous
myself.

