Some Thoughts on Unicode in Perl 6

Mithaldu · on Nov 30, 2013

Counting graphemes by default is really something all computer languages will need to adopt at some point in the future if we collectively want to get out of the morass of encoding errors. Good to see at least one language is making tiny inroads towards that.

octo_t · on Nov 30, 2013

Even in perl5, the unicode support is incredibly good:

http://stackoverflow.com/questions/6162484/why-does-modern-p...

pyre · on Nov 30, 2013

One of tchrist's Unicode talks from OSCON '11:

http://dheeb.files.wordpress.com/2011/07/gbu.pdf

(The official URL is: http://training.perl.com/OSCON2011/index.html but it's currently timing out for me)

berntb · on Nov 30, 2013

Thanks, helpful reference!

It do feel like getting information on the construction method of the whips which will be used on your back sometime in the future. :-) :-(

buster · on Nov 30, 2013

"good"? I always found that it was extremely annoying and a hell to get done right, what the fuck with all the different options? Until today the last tool i wrote that i wanted to have unicode-proof still spits out the occasional warning, but i just don't care anymore for the warnings that pop up. Even if perl5 can do unicode, it's a nightmare to use, imo.

And the first comment (which i did go through for my ventures into perl unicode) is nothing but prove how broken it is. So maybe you are just being sarcastic and i didn't get it.. ;)

Mithaldu · on Nov 30, 2013

You don't get it.

First of all you don't NEED most of the stuff shown there. The boilerplate at the bottom is there to cover EVERY POSSIBLE THING anyone would possibly ever want to do with unicode in Perl and because tchrist thinks it's funny to do such things. If all you want is to read a unicode file, change its contents, then write it back to hdd without breaking encoding, you don't need more than:

    use utf8;
    use IO::All -utf8;

Secondly, all other languages do it worse than Perl 5.

berdario · on Nov 30, 2013

> Secondly, all other languages do it worse than Perl 5.

[citation needed]

Jokes aside :) if you look at the 2011 OSCon presentation pdf, you'll see it's vastly outdated (and imho the default internal encoding table is just imprecise)

OTOH I don't use unicode regexps very often, so my POV is biased (I'm sure that on this front perl5 can actually be better than virtually everything else)

buster · on Nov 30, 2013

Well, that's nothing more like the most basic case of string usage. Of course that works.

For a rather complex multithreaded client/server program that reads and writes difference sources (http, webdav, ldap, file, terminal) with a lot of dependencies it really was a nightmare, trust me. Maybe i didn't get some obvious global "do that unicode stuff right" switch and i am just too dumb. But even then, i've never had such problems in _any_ other language.

Interestingly i did need to write a very similar tool one year later in python and i had much less problems. And that was python 2.x which is always said to have bad unicode support. Maybe it's just because in python (afaik) you only have to take care of some str() or unicode() semantics and not about a gazillion of use statements, open() parameters and whatnot.

Maybe Perl5 has perfect unicode support but how to use and achieve that is a freaking nightmare in a non-trivial program.

Mithaldu · on Nov 30, 2013

Quite honestly, there's not even any need for the "use" statements. When dealing with unicode you only need to do one thing:

When dealing with any input/output, find out whether you read/write bytes from/to it or characters, and depending on what you find, decode or don't.

I'm very sure that the only reason you had trouble with your first tool was a lack of knowledge and thus a lack of pervasive diligence, leading to you leaving some inputs/outputs unhandled or even worse, handled double.

Then you ended up doing that stuff more correct on your second try, since python asks you to do the same things.

All the use statements in Perl do is make it so you actually don't need to do that because it makes certain operations work in incode mode by default.

In short: Please try to actively separate your own past shortcomings and growth as a developer from judgements on the capability of a language.

buster · on Dec 1, 2013

Mh, no. That's not it, since in the beginning the code didn't use those statements. I was taking ownership of the code and then had weird warnings popping up all over the place. The fix for them was always one of those things from the stackoverflow list.

So, no, i'm not convinced perl has good unicode support in terms of actually supporting the programmer. Mind you, that was 2 or 3 years ago, so maybe the situation is different now. To me, it was just a stupid task of creating new tests, finding a warning and then trying to figure out what the fix is. Sometimes the fix broke other parts or libraries and then you go to fix #2 (because of course there are a gazillion ways in Perl to do a single task). To me it was really not a very good experience. And really, just look at that stackoverflow article. That's a consistent and good unicode support? No. Maybe the current perl is different, though (i think the project ran 5.12 or 5.14).

Mithaldu · on Dec 1, 2013

Man, i didn't want to trot this out at the start, but well, it turned out my initial hunch was entirely correct. You took software that was made by flailing at it like a monkey, then flailed at it some more. No wonder it turned out as an unpleasant experience.

Also, as to your question about the SO post. You claim:

"The fix for them was always one of those things from the stackoverflow list."

I find myself thoroughly baffled by that, since over half that post busies itself with promoting warnings into fatal errors that end the program with stacktraces.

Honestly, i'd love to see your warnings-spitting tool, just so i could straighten it out some.

pyre · on Dec 1, 2013

I think your issue is that, in general, Unicode is Hard(tm). For example:

* How do you compare strings in Ruby so that an "o with umlaut" as a single character can be compared against a "o" + "umlaut" characters. Technically they are both equivalent, but if you're doing comparisons by single characters (or bytes!), then you'll run into issues.

* How do you use a regular expression in Python to match against o and "o with umlaut" as equivalent?

octo_t · on Nov 30, 2013

I would add the caveat of 'does unicode well without breaking backwards compatibility'

greenlakejake · on Nov 30, 2013

Perl 6 is the Unicorn of computer languages - a mythical beast.

Yes I know I'll be downvoted to zero karma.

hibbelig · on Nov 30, 2013

Duke Nuk'em Forever! :-)

greenlakejake · on Nov 30, 2013

I used to joke that Duke Nuk'em Forever hadn't shipped because they were using Perl 6. But Duke shipped first. :)