
Ruby 1.9 made me remember how I hate the concept of encodings - fleebie
http://www.rubyfleebie.com/ruby-1-9-made-me-remember-how-i-hate-the-concept-of-encodings/
======
jerf
Encodings are fundamentally hard in our current code environment. If your
language doesn't make you explicitly think about encodings, you are writing
bad code. Period. Full stop. If your language does make you think about
encodings and you just make it go away with compiler incantations or just
bumbling about until the problems seems to go away, sort of, as long as you
don't poke them too hard, you are writing bad code. Period. Full stop. If your
language has no support at all for encodings, may God have mercy on your soul.

That said, "convert everything to your native Unicode format at the edges and
reconvert it back out at the edges" is at least a tolerable answer. You still
lose things, but it puts you ahead of most programs. But few environments make
even that really _easy_ , because it turns out to be difficult to identify
_all_ the edges; sure, your web framework may emit and send unicode (and then
again it may not...), but did you read files off your disk in the correct
encoding? Does your database correctly handle encoding? Does all the other
code that ever inputs or outputs anything handle Unicode correctly? Do you
ever store something in a system that is really just for storing binary blobs,
and forget about the encoding?

It's hard, it's tedious, and from what I've seen it's even harder and more
tedious than it has to be because so little of the system is usually built to
make it work right, because the people creating all your libraries were either
ignorant or perhaps even contemptuous of the issues.

I have often thought about what change I would make in 1970 if I could to fix
a lot of modern code. Eliminating the null-delimited buffer is definitely
number one, but explaining that there is no such thing as a "string" without
an encoding label would be number two. Anywhere I see a "string" in the input
or output specification for a function I just cringe.

~~~
xtho
This has little to nothing to do with the current situation in ruby.

> That said, "convert everything to your native Unicode format at the edges
> and reconvert it back out at the edges" is at least a tolerable answer.

It obviously isn't for the ruby developers. If it were so they had chosen utf8
as internal encoding, which they didn't because they didn't consider this a
tolerable answer. Even though you can get ruby 1.9 to work this way, this
approach could still cause some headache.

~~~
jerf
"This has little to nothing to do with the current situation in ruby."

I was addressing the complaint that the encoding in Ruby is hard now, and it
broke working code. Encoding is fundamentally hard, and if encoding used to be
easy it is almost certainly because your old code got it wrong, and your old
code _probably_ wasn't working. I emphasize the "probably" because it is
faintly possible that your old code really did work and now it really doesn't
work, in which case I would understand the frustration, but if I were giving
odds on the chance that the old code actually handled everything correctly I'd
open the bidding at somewhere around 5:1 for a superstar encoding expert
(working in a language with poor encoding labelling support), with the odds
getting worse the further from that you get. There are some things that are
just hard without language support even for experts.

------
sanxiyn
Re: characters not covered by Unicode (or BMP). Yes, they exist, but they are
red herrings. Unregistered characters will always be with us. I fully
guarantee that. But for vast range of applications, Unicode does work.

I am from Korea, and one of the treasure of this country is Tripitaka Koreana,
compilation of Buddhist texts carved in the 13th century. It is 52,382,960
characters long, Wikipedia tells me. There is a whole institute devoted to
this document. This institute started to encode it in machine readable form
from 1993 and completed the first draft in 2000. In the process they
discovered 23,385 new letterforms not registered anywhere. There are many such
encoding projects yet to be completed. So yeah, Unicode won't cover
everything. That is given. And that's okay.

------
mark_l_watson
Sure upgrading to Ruby 1.9.x is a hassle (character encodings, changes to
array class, etc. does break some old code). That said, 1.9 gives a good
performance boost that Ruby needs, so man up and just do it.

It is also a "public good" issue: the sooner everyone up-converts to 1.9, the
easier it will be to develop with Ruby because all required gems will work,
etc. I have whined quite a bit on up convert hassles on rubyplanet.net, so I
do understand the author's pain + complaints, but we do all need to move
forward.

~~~
xtho
The sooner? ruby 1.9 is around for more than a year??? now. 1.8 is still the
standard in ubuntu and about everywhere else.

With respect to performance boost: Startup time didn't improve. And IIRC the
same is true for certain string operations. Both are important in the field
where ruby originated and where I personally still find it most useful --
scripting or rather as perl-replacement.

~~~
alttab
We use ruby at Spiceworks and are internally switching to 1.9.1. We are doing
it for performance as well as internationalisation. While the encoding was an
issue upfront (and we have guys converting our app from 1.8.6 as their primary
focus), we do the edge-UTF8 approach, and we've updated a lot of the gems to
1.9 without waiting for others.

~~~
wallywalrus
Get back to work, Scottie.

------
crazydiamond
I am not sure anyone else has it right, meaning Java and Python. I have read
some article a month or 2 back, detailing what others have done -- none are
proper/complete solutions. But from what I've seen in ruby discussion, a lot
of rubyists are having problems with the new encoding system.

~~~
DrJokepu
In C#, everything is assumed to be UTF-8 unless you explicitly change the
encoding (the language-independent runtime is a bit more complicated). The
only exception is where it is likely that some input in not UTF-8 (such as a
byte array), in that case, you have to explicitly define the encoding to use.
Works pretty well, never really had encoding problems in C# and I had my share
of dealing with non-latin characters in my apps.

~~~
sid0
I'm pretty sure C# uses UTF-16, not UTF-8. Regardless, Unicode is Unicode.

~~~
DrJokepu
Actually you're right, UTF-16 it is. Unfortunately it is too late to edit (or
delete) my original comment.

------
brianmario
Some people may not like it, but this is exactly why I chose to force UTF-8
for <http://github.com/brianmario/mysql2> for all the strings you get back (in
1.9), and the connection itself. We've all dealt with improper use of
encodings between applications, their persistence layer and their
presentation. It's a nightmare unless you put your foot down and say "We're
making everything Unicode, nothing comes in or leaves unless it is". This
obviously doesn't work for _everyone_ but it's my experience that it will work
for 99% of all use cases.

------
Confusion
The author doesn't seem to understand the difference between an encoding and a
character set. We already have a character set that denotes any possible
character in the universe: Unicode. We also have several encodings that allow
us to reference each of the characters in the set, the most well-known of
which is UTF-8. However, UTF-8 is optimized for the Western code points, which
is why alternate encodings exist. Moreover, there is all kinds of data in
legacy encodings that we want to work with. Encodings are hard, but you can't
go shopping without '?' showing up in your apps.

------
pkulak
The biggest problem is that everything, after a clean install, seems to
default to latin1 or ASCII, so the first thing you need to do is run around to
every single piece of software your app touches (the db, web forms, the
database, the OS, etc) and make it send and receive unicode. And God help you
if you forget one.

~~~
moe
_And God help you if you forget one._

God speaks latin1, too.

------
Tichy
I never understood how to do unicode in Ruby 1.8, though. Or I could never be
sure would put it more succinctly. Especially not with Rails, somehow I could
not even find anything about that via Google. It seemed to somehow work, but I
want to know what is going on - is my stuff utf-8 or not.

~~~
carbon8
As I understand it, Ruby 1.8 strings are simply treated as non-encoded
sequences of bytes. It has some support for UTF-8, so as long as you can use
UTF-8 and ISO-8859-1 (latin1, since the mapping is the same), just think about
it as passing around byte sequences and don't need to do a lot of string
manipulation, it's not horrible to work with. However, you do need to jump
through hoops to use a number of the standard string methods like #length with
multibyte characters (often, this means running the the string through
#scan(/./mu) before working with it:
[http://blog.grayproductions.net/articles/bytes_and_character...](http://blog.grayproductions.net/articles/bytes_and_characters_in_ruby_18)).
Rails also has a UTF-8 handler to help with manipulation of UTF-8 strings with
multibyte characters, and I think it's on by default
([http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/H...](http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Handlers/UTF8Handler.html))

~~~
jamiecobbett
In a sentence, Ruby 1.8 Strings are just a set of bytes, so methods like
length count the number of bytes, Ruby 1.9 Strings are a set of characters, so
length counts characters. This makes a difference when you have multi-byte
characters, and want an accurate length, a correct split, reverse etc.

I found this series of articles well worth the time to read and understand:
<http://blog.grayproductions.net/articles/understanding_m17n>

------
nudded
To the author of this post:
<http://www.joelonsoftware.com/articles/Unicode.html>

~~~
dan_sim
That's the point of the post, we should not have to know anything about
encoding. It should just work.

~~~
crazydiamond
I think there are trickier issues which a programming language has to deal
with. For example, concatenating 2 strings using different encoding.
Conversion is required. Specifying encoding is required.

And btw, here there is a loss since all character sets are not fully covered
even in UTF-16 (iirc). I am trying to recall- maybe Matz gave a details reply
somewhere on the net.

~~~
sid0
> And btw, here there is a loss since all character sets are not fully covered
> even in UTF-16 (iirc)

If they aren't covered in UTF-16, they wouldn't be covered in UTF-8 or UCS-4
either. All modern Unicode encodings (i.e. _not UCS-2_ ) can encode exactly
the same data.

I'd be curious to know which character sets Unicode doesn't cover yet a
different encoding system does.

