

Resetting PHP6 (or: Unicode claims another victim) - corbet
http://lwn.net/SubscriberLink/379909/26c35a974b1bbd65/

======
pmjordan
Wow. Considering PHP is used primarily for generating HTML, UTF-8 is the de-
facto standard encoding for unicode on the web, and PHP files are usually
ASCII or UTF-8, going for UTF-16 seems like a phenomenally bad idea. I can see
how UTF-16 looked promising back when it was still UCS-2 and there were no
surrogate pairs. These days though, UTF-8 and maybe UTF-32 seem to be the
realistic choices when working from scratch; UTF-32's advantage in some areas
is probably too weak to make it a real contender unless your strings are
literally linked lists, not codepoint arrays. (i.e. you don't care that it
uses 2-4 times as much memory or storage)

~~~
Carlfish
One advantage of UTF-16 is that unlike UTF-8, very few characters you
encounter in Real Life invoke surrogate pairs. So for the cost of a one-bit
flag per string you can assume two bytes per character in your string
operations for the overwhelmingly general case.

~~~
jerf
"So for the cost of a one-bit flag per string you can assume two bytes per
character in your string operations for the overwhelmingly general case."

Thank you for that clear and concise explanation of the dangers of using
UTF-16.

Yes, I know that wasn't your intention, but it was the end result. One of the
most dangerous library failures you can have is a function that works 99.99%
of the time. Or in this case, 100% of the time on the input the English-
speaking developer provides but distinctly less than 100% in the field.

In this specific case, you can't actually optimize anything because all your
optimizations are _bugs_. You can't just divide by two for character count;
that's not an optimization, it's a bug. You can't just multiply by two for a
substring operation, because you might chop a character in half, that's a bug.
And so on. You'd need a separate type that indicates you've scanned the string
to verify it never has split chars and now you might as well be on UCS-2, and
that has its own dangers w.r.t. working 99.99% of the time.

Much better to use UTF-8, where the dangers are much more apparent, all you
have to do is leave the base ASCII case and you're testing UTF-8. Even I, an
English-speaking developer, manage to test that case (once I know it exists,
anyhow). There's still ways you can screw up but you're off to a much better
start.

~~~
nradov
Right. From what I have seen, the majority of Java code out there is broken in
_exactly that way_.

------
pbiggar
I think the author has missed the larger problem. The PHP development
community is completely dysfunctional. I don't think that a project of the
magnitude of PHP 6 is possible without fixing that fundamental problem.

Why is it dysfunctional?

\- every discussion leads to bikeshedding (and almost none of the bikeshedders
actually commit code to the Zend engine)

\- there are 'rules', but they don't apply to most people (ie the 5.4 thing in
the article)

\- no firm hand to guide them (Rasmus has deliberately not provided this)

\- the mailing list has a complete lack of civility

\- highest concentration of poisonous people to non-poisonous that I have ever
seen

\- votes for everything

\- patches are not discussed, either pre or post commit, so the code is bad,
and people won't work on it.

I was so glad to be the hell out of there.

~~~
jrockway
Dude, it's PHP.

~~~
pbiggar
This is the root cause of the problem. Since it's PHP, you can't get really
good developers to work on the core. The people who love PHP and the people
who have both the desire and skill to do the work described are separate
groups.

~~~
jacquesm
Considering that it is as widely adopted as it is though, you'd have to agree
that they succeeded in spite of all these hurdles.

PHP is a band aid, but as a band aid it served it's niche remarkably well,
imagine if clojure or some other better designed language would attract such
an enormous following and would be so easy to deploy.

Even today mod_php runs rings around mod_wsgx in that respect (and it's
already a lot better then mod_python).

PHP has _tons_ of shortcomings, but it is relatively good at what it does, and
that's what drives it forward, not the people behind the project. Say python
and everybody things 'Guido van Rossum', say Clojure and 'Rich Hickey' jumps
to the foreground.

As long as I've been using PHP I would have a hard time coming up with the
full name of it's lead developer. That 'lack of personality' and the chaotic
development process may actually contain some hidden benefit.

Absent a strong leader there will be many people pushing and pulling in
different directions, it may have gone too far but there is a lesson in there
somewhere.

~~~
pbiggar
I'm not discussing the success of PHP. It has done well. However, if you are
suggesting that PHP has been successful _because_ of its lack of a leader, I
think you would need to justify that.

------
jrockway
One of the number one bugs in web apps is assuming that characters can just
"flow through" your application, as the article claims is a common case. Sure,
if everything is UTF-8, it might work. But the fun comes when some of your
data is us-ascii, some is iso8859-1, and some is utf-8. Now treating your data
like binary is going to result in a garbled web page. So don't do it; decode
data from octets to characters when it comes in your program, manipulate
internally as character strings, and encode characters to octets when you
output your data. Text is not binary!

And if I were Zed Shaw, this is the part where I'd threaten to kill you if you
don't meet my demands.

~~~
randallsquared
Actually, the real problem is mixing Windows-1252 and UTF-8, while working
with tools that assume your Windows-1252 is really ISO-8859-1. ASCII, after
all, is just a subset of UTF-8, so there's no handling required for it if
you're already assuming UTF-8.

------
aarongough
_PHP was slated to gain a goto keyword_

How on earth did they decide this was a good idea? The points about getting
rid of register_globals and safe_mode are great, but why add a _feature_ to a
programming language that is highly likely just to result in lots of awful
code?

~~~
pak
There are ways to use goto effectively... in fact, the other day I was working
with an awkwardly nested try/catch that would have read much clearer as a goto
(and more efficiently; an empty Exception was being used to trigger the
catch).

I've seen this problem "solved" with a do while false and a break, but isn't
that even hackier and less expressive?

At this point I think goto-phobia is well understood enough that adding it for
occasional use wouldn't ruin the language.

~~~
aphyr
Ruby distinguishes between begin/rescue (analogous to Java's try/catch) and
throw/catch. I never understood it until discovering that throw/catch is
perfect for dealing with nested cases where rescue or break fail. Also, it
works through function calls and isn't interpreted as an error, which makes it
great for cases like Ramaze's redirect or render calls; unlike in Rails, they
interrupt execution immediately.

------
dasil003
I expect this pretty much kills PHP 6 in Japan.

------
IgorPartola
Is anybody interested in buying my book "Definitive guide to PHP17"?

