
Regexp Ranges and Locales: A Long Sad Story - js2
https://www.gnu.org/software/gawk/manual/html_node/Ranges-and-Locales.html
======
kijin
Leaving the behavior undefined in almost all commonly used locales (i.e.
anything involving UTF-8) doesn't seem to be a particularly helpful standard.

It's unreasonable to expect someone who writes regexp to anticipate which
locales a user will execute his program in. It's just as unreasonable to tell
him to stick to an outdated locale. How about we ignore locales altogether and
just use code points? Code point order is the only ordering that every locale
can agree on. [a-z] should match any character whose code point is between
U+0061 and U+007A regardless of the locale.

~~~
dwheeler
I agree that it is terrible that this is undefined, but utf-8 is not a Locale.
That is an encoding. The standard works just fine when you use C as the Locale
and utf-8 as the encoding. I think they should have just defined ranges as
being the encoding values, because that would make more sense, but that would
be exactly the opposite of what the standard previously said.

~~~
kijin
> utf-8 is not a Locale.

Of course it isn't. What I suggested above is to use neither locales nor
UTF-8, but code points. "a" = U+0061 no matter which locale you're in, and no
matter which encoding you use. Every locale and every encoding is based on the
same universal mapping of characters to code points.

~~~
oldmanhorton
Sure, but those code points are arbitrary. For instance in German languages,
you may want a-z to include umlauted vowels, or you may not. That's a locale
specific setting, even though the umlauted characters come well outside of the
range of ascii a to z.

~~~
mmt
Keeping ranges undefined doesn't satisfy these wants, either.

Using code points would at least allow for ranges to have the possibility of
being usable to someone in a standard, predictable fashion, outside of the
C/POSIX locale.

For example, specifying a-z plus each umlauted vowel is still shorter than
specifying all letters individually.

Perhaps there is some wisdom in William S. Burrough's "If you can't be just,
be arbitrary."

------
tzs
I'm sure POSIX thought about this a lot more than I have, so I'm probably
missing something and am about to say something that is actually stupid,
but...

It should overload items of the form "x-y" in ranges where x and y are single
characters in the locale in use. It should define specific items of this form
as not being ranges but simply shorthand for certain predefined strings. The
expression is treated as if those items were replaced by the corresponding
predefined strings before the regular expression was parsed.

In particular "a-z" => "abcdefghijklmnopqrstuvwxyz". Similar for uppercase.
Include such a definition to produce each possible substring of length 2 or
more of "abcdefghijklmnopqrstuvwxyz". Similar for "0-9".

~~~
raverbashing
I'm not sure what you're saying, when would that be different from what we
have today?

I don't see why that would be beneficial. You also might want ranges as
[a-fk-z] (for example)

~~~
tzs
> I'm not sure what you're saying, when would that be different from what we
> have today?

The situation today is that "[a-z]" is undefined by POSIX if you are not in
the POSIX locale. I'm suggesting that it, and similar cases, should be made
defined in all locales that include a, b, c, d, e, f, g, h, i, j, k, l, m, n,
o, p, q, r, s, t, u, v, w, x, y, and z.

> You also might want ranges as [a-fk-z] (for example)

Looking back, I wrote unclearly. Where I wrote 'overload items of the form
"x-y" in ranges' it would have been better to write 'overload items of the
form "x-y" inside bracket expressions'.

In "[a-fk-z]" there are two items of that form. Under my suggestions, "a-f"
would be replaced with "abcdef" in all locales, and "k-z" would be replaced
with "klmnopqrstuvwxyz", giving "[abcdefklmnopqrstuvwxyz]".

~~~
bonzini
But would a-z also include for example à and ä, or should 0-9 include ½?

(The solution that glibc will implement is to un-interleave lowercase and
uppercase characters whenever the collation order is like aàAÀbBcC...).

------
rwmj
Related:

[https://sourceware.org/bugzilla/show_bug.cgi?id=23393](https://sourceware.org/bugzilla/show_bug.cgi?id=23393)
([https://news.ycombinator.com/item?id=17557243](https://news.ycombinator.com/item?id=17557243))

------
chrismorgan
> _‘[ "-/]’ is perfectly valid in ASCII, but is not valid in many Unicode
> locales, such as en_US.UTF-8._

Why is this the case? Collation sequences, I’m guessing?

~~~
jwilk
Yes, slash sorts before double-quote in this locale:

    
    
      $ (echo '"'; echo '/') | LC_ALL=en_US.UTF-8 sort
      /
      "

~~~
a1369209993
Edit: nevermind, I apparently rm'd the offending locale directory last time I
encountered a bug like this and sort is silently ignoring LC_ALL:

    
    
      bash: warning: setlocale: LC_ALL: cannot change locale (en_US.UTF-8): No such file or directory
    

I can't reproduce this bug:

    
    
      $ (echo '"'; echo '/') | LC_ALL=en_US.UTF-8 sort
      "
      /
      $ sort --version
      sort (GNU coreutils) 8.26
    

What version are you using?

------
rspeer
I've seen something that sounds related. In grep, in the en_US.UTF-8 locale,
sometimes I can match [A-Z]+ and it will match accented uppercase strings such
as "SCHÖN". It will not match lowercase letters.

This is often desirable, except for the part that I don't know what the heck
ranges mean anymore. "Ö" is certainly not between "A" and "Z" in codepoint
order. It is in collation order, but if it were collation order, it would
match lowercase letters. How does this work?

~~~
bonzini
Collation order did not interleave lowercase and uppercase until recently,
except in a few oddball locales (e.g. cs_CZ.UTF-8).

Interleaving was added to all locales recently, and people started complaining
that their scripts broke, so it will probably be reverted.

------
gnufx
People seem to be missing character classes like [[:upper:]]. If you need some
other sort of range in a portable script, say, just make sure you set
LC_COLLATE. And if you're testing with GNU sort, use --debug to check what
it's actually doing in case you don't have the definition for the current
LC_COLLATE, for instance.

------
theothermkn
My main problem with the [A-Za-z0-9] notation is that it looks great at first
glance. I mean, it looks really, really great. _Of course_ [A-Z] means all
capital letters. And then you think about it, and you start to suspect
something like the situation described in the article. Suddenly, you're in the
familiar but dizzying position of being perched atop a shoddily-built and
wobbling tower of abstractions. You feel your familiar nausea soaking in from
the periphery of your editor window.

I _just_ now, minutes ago, got some regexes to mostly sorta work in a project
to convert some jai alai score data (converted from wonky pdf files). It's a
one-off script. My Python is rusty, but I couldn't find how to get a posix
descriptor for 'lowercase letters and uppercase letters' to work. [A-Za-z]
happens to work, for now.

I _love_ regexes. I _hate_ regexes.

~~~
Sharlin
Every (Finnish) first grader knows that [A-Z] _obviously_ doesn't mean all
capital letters, because the letters Å, Ä, and Ö follow Z in the alphabet ;)

------
jwilk
> the 2008 standard had changed the definition of ranges, such that outside
> the "C" and "POSIX" locales, the meaning of range expressions was
> _undefined_

It was changed earlier. It's undefined in the 2004 edition, too.

~~~
tzs
And no change in the 2017 edition.

------
bjourne
Aren't there already half a dozen Linus Torvalds rants on the brain-damaged
stupidity of the POSIX specs? The story is long and _sad_ because the
developers thought following the spec was more important than writing usable
software.

