

How to properly use UTF-8 in Perl - ojosilva
http://stackoverflow.com/questions/6162484/why-does-modern-perl-avoid-utf-8-by-default/#6163129

======
jrockway
This is way too much to remember.

The key to handling Unicode correctly is to think about the data that comes
into your program and the data that leaves your program. If it's text, it has
some sort of encoding, and you need to decode those encoded characters before
you can use them as text in your program. Similarly, you can't just leak
Perl's internal representation of characters to the world: you have to
explicitly encode the characters to octets.

The problem that people run into is not knowing that you have to do this (in
every programming language), and then deciding which encoding to use. In some
cases, it's easy: HTTP includes the encoding in the headers, so all you have
to do is get the request, decode according to the headers, and you're set.
Similar, when you send an HTTP response to someone, just set the headers,
encode the characters, and everything ends up 100% correct.

The problem comes when poor design causes you to assume the character
encoding. What encoding are file names on a random removable disk? What
encoding is this text file in? What encoding are those database rows? If you
don't know, you simply can't process that data correctly as text. But most
people assume "something will magically decide for me and it will all work
out". Nope, it won't. Don't rely on magic: be explicit.

The reason why things work most of the time is because you treat the data as
binary: opaque, meaningless octets, that don't have semantics like "make the
first letter capital". This will work if your world is entirely UTF-8-encoded:
UTF-8 filenames, UTF-8 files, UTF-8 source code, UTF-8 database results, etc.

The reason why people run into trouble with Perl and not with other languages,
is because Perl assumes that when you treat binary data as text, you really
have Latin-1 text. It then says, "hey, this is text", upgrades it to Unicode,
and then transforms it as such. When your text was Latin-1 (the backcompat
case), it all works. When it's UTF-8, though, then you get double encoding:
the dreaded "æ¥æ¬èª" instead of "日本語". But if you just tell perl, via
Encode::decode_utf8, it will know that you actually have unicode text and not
latin-1 text, and everything will work!

Anyway, the solution is to always specify the encoding via
Encode::decode_utf8($str) or Encode::decode('my-encoding', $str) when reading
data, and to Encode::encode_utf8 or Encode::encode when outputting data (even
to the terminal!). The hardest part is making sure your libraries do this
(DBD::SQLite does, Catalyst does, LWP does), and making sure that your
libraries have enough information to do it right. HTTP is easy, there's a
header. But that blob in your database is not easy, and you may have to handle
it yourself... because the information about the encoding exists only in your
brain, not anywhere the computer can find it.

(Oh, and this only gets you to the "I'm not trashing any information" stage.
If you want your Japanese text to sort あ い う え お instead of in codepoint
order, that's going to require a module. Even simpler cases will require you
to learn about collation and normalization.)

Edit: distilled the key points I raise here into an SO answer:
[http://stackoverflow.com/questions/6162484/why-does-
modern-p...](http://stackoverflow.com/questions/6162484/why-does-modern-perl-
avoid-utf-8-by-default/6192088#6192088)

------
js2
See "The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character Sets (No Excuses!)"

<http://www.joelonsoftware.com/articles/Unicode.html>

------
unwind
I think I'll hold out for the ... simple way. There should be more than one
way to do it, right?

~~~
hercynium
Indeed... Simple, and probably wrong.

I won't pretend that Perl's unicode support sucks; Working with Unicode in
Perl is far more complicated than it ought to be and rife with bugs and
corner-cases. However, working with Unicode _properly_ in just about _any_
language requires a whole lot of careful thought and consideration.

Maybe Java (and by extension JVM languages) is the best you can get right now,
I dunno...

~~~
hercynium
whoops, I meant that it _does_ suck, and I won't pretend otherwise.

------
dexen
After reading this lengthy how-to, I am tempted to say that PHP _may_ have
upper hand when it comes to _reliably_ dealing with UTF-encoded data from
different sources (including source code) and sinks. At least as long as I
default to UTF-8 and convert only when explicitly requested by user or remote
system to do so.

PHP has no separate Unicode type, the built-in string type works pretty well
for UTF-8. I'd liken its use to the concept of duck-typing (as seen, for
example, in Python): if it walks like a duck^W^W UTF-8, swims like a duck^W^W
UTF-8, quacks like a duck^W^W UTF-8, it can be handled (consumed and produced)
like a duck^W^W UTF-8.

Note in Python Unicode is a separate data type, so the above comparison is
related to duck-typing only.

~~~
pornel
It's not pretty in PHP either. Many of the Perl issues have their PHP
counterparts.

• You need to be aware that most built-in string functions won't work and use
mb_ equivalents where available (and not all are available).

• You need to set internal encoding to UTF-8 for some extensions in php.ini.

• You need to normalize all input yourself (thankfully PHP5.3 has Normalizer
class. In 5.2 it was a nightmare)

• PCRE library doesn't support all uncode ranges (e.g. {InFoo} is
unsupported). All gotchas of regexes apply to PHP/PCRE as well.

The list goes on…

"Binary-safe" strings and UTF-8 cleverness will let you roundtrip characters
safely in most cases, but full, correct Unicode support is really hard.

~~~
ars
> You need to be aware that most built-in string functions won't work

Actually the non-mb string functions will work perfectly fine for almost
everything. For example splitting on space or comma, or joining strings will
work just fine with UTF-8. The only things that don't work are those that
split by character position. Even search works fine.

> You need to set internal encoding to UTF-8 for some extensions in php.ini.
    
    
      ini_set('default_charset', 'UTF-8');
      mb_internal_encoding("UTF-8");
    

> You need to normalize all input yourself

Only if you need to compare against it (for example to see if the username was
taken). Most of the time you just store the input as is.

For PCRE make sure to add the /u option.

~~~
dexen
I'd just stick to mb_ereg_...() functions. They follow the standard
setlocale(LC_...). They support the standard POSIX regex syntax. They support
Unicode always when the platform does (depends only on standard C library) --
and I've seen PHP's PCRE distributed _without_ Unicode support. Last but not
least, PCRE suffers from <http://swtch.com/~rsc/regexp/regexp1.html>

    
    
      setlocale(LC_ALL, 'pl_PL.utf-8'); # if not set explicitly defaults to system's $LC_ALL, thus fits most of the time
      mb_internal_encoding('UTF-8');
    
      mb_ereg_match('[[:alpha:]]', 'Ł') -> true
      mb_ereg_match('[[:alpha:]]', '♙') -> false # that's a white chess pawn character, U+2659.
      mb_ereg_match('[[:graph:]]', '♙') -> true

~~~
ars
I don't like setlocale because usually I'm writing a general application,
usable for any language.

~~~
dexen
I like setlocale() as a central knob that other functions obey. Works a-OK for
me on several different websites. Normally my website's code starts with
platform defaults, reads configuration early on, and, if need be, issues
setlocale(). Should an error be triggered before the config is applied, it
would be handled with default locales; I believe that's the only noticeable
downside. Hopefully a rare one ;-)

setlocale() can be switched at run-time, even multiple times if you want. You
could imagine running most of the code with user's preferred locale to output
UI or report any errors, switching to another locale to format and send
message to a foreign customer, and then back to normal settings to continue
with outputing UI, all in one pass of PHP.

