

Perl Unicode Cookbook - draegtun
http://training.perl.com/scripts/perlunicook.html

======
pavelkaroukin
I love to develop with perl, but unicode-related stuff was always frustrating
for me. Probably because I deal with it so rarely and everytime I have to look
documentation to find out how to read unicode file, run regexps on unicode
strings, etc.

I am wondering, how other major languages deal with unicode? Is there language
which work with unicode files/text transparently without /me involving into
setting something differently from defaults?

~~~
draegtun
Perl comes out best according to Tom Christiansen (author of this _Perl
Unicode Cookook_ post).

See his _Unicode Support Shootout - The Good, the Bad, & the (mostly) Ugly_
presentation at OSCON 2011 where he gave a comparison of Unicode handling in
Javascript, PHP, Go, Python, Ruby, Java, and Perl.

ref: <http://training.perl.com/OSCON2011/index.html>

~~~
joeyh
Perl is in the lower tier IMHO. Below Python and Haskell. Because while it has
a good core unicode support, every single library makes different choices, so
your code constantly needs to encode/decode. Javascript seems to avoid
encoding issues entirely, at least in the browser, though Tom has some nice
points about it not full supporting unicode.
<[http://kitenet.net/~joey/blog/entry/unicode_ate_my_homework/...](http://kitenet.net/~joey/blog/entry/unicode_ate_my_homework/>);

~~~
ajross
Can you be more specific? Perl's utf8 support is IMHO the cleanest and most
transparent. Generally libraries don't have to care at all about the encoding
of the stuff they deal with. I'd be curious what trouble you've had.

(It's true that early versions of "use utf8" were horribly broken, munging
strings silently at I/O time, but that was 10 years ago or more)

~~~
joeyh
Nearly every XS library takes input with utf-8 encoded to bytes, so you have
to manually juggle the encoding when using it. There is no documentation
stating what encoding any library expects. I just checked a source tree and
found 41 calls to encode_utf8 and 42 calls to decode_utf8 to deal with this
across dozens of libraries.

~~~
ajross
Um... of course they do. XS is a translation library and the domain you're
translating to (C) has no (well, no sane) native encoding handling. So you
pass byte buffers. That's not a ding on _perl_ though, it's an inherent
impedance you get when you cross interface domains. JNI libraries tend to have
most of the same problems, either forcing the user to do the UTF16/UTF8
conversion or doing it in the library.

I guess I'm surprised that Python and Haskell don't have equivalent messes
when dealing with native libraries. Are you sure they don't?

~~~
joeyh
Perl XS libraries often provide a full object oriented wrapper around the
native C calls. They always have, at a minimum, a thin perl wrapper function.
There's plenty of scope for the encoding to be done automatically, but the
current state is that it's not even done manually by the libraries.

Haskell's equivilant FFI interfaces do encoding conversions, so that if they
return a String, it's a sequence of unicode code points, as the Haskell
standard requires. Those that return raw data will use an appropriate
ByteString type. Haskell didn't used to do encoding conversions for FilePaths
in the FFI, but this has been fixed in version 7.4, and is even done in a way
that doesn't require unix filenames be utf-8 encoded -- other encodings will
pass through unchanged.

~~~
chromatic
Encoding conversions at the FFI layer would be very nice. How do the Haskell
interfaces know which encodings to expect?

------
obtu
There's something wrong with the font size of code samples.

~~~
pavelkaroukin
code snippets tends to be long. so they have font-size: 65% for <code>

I've found increasing font is required anyway, since my browser default font
size a bit small for my screen resolution.

Increasing font size is quite easy in most browsers. Hold Ctrl key and use
mouse wheel to adjust size. This is per-site setting in most browsers I
believe.

~~~
obtu
I've reenabled Firefox's minimum font size; it's a bit buried in the
preferences (fonts->advanced) but it's more effective to fix sites where the
sizes lack consistency. For most sites NoSquint works well.

~~~
pavelkaroukin
this particular site uses browsers defaults. calling this lack of consistency
is wrong :)

------
arnsholt
I'd love to have the POD of this locally, but can't seem to find it anywhere.
Anyone know where it lives?

------
scottw
Isn't that just a little ironic that the Unicode bullets have been converted
to the prescription symbol?

~~~
gregheo
I'm pretty sure that's on purpose (Unicode prescription 1, prescription 2,
etc.). Or did my <sarcasm> tag detector mess up again?

