
Glibc: [0-9] matches ¼ ١ ２ 〣 and others, but not ９ (and other nines) - rwmj
https://sourceware.org/bugzilla/show_bug.cgi?id=23393
======
js2
It may not be clear till you read all the comments on the bug that the ９ which
isn't matched is a FULLWIDTH DIGIT NINE.

If you're using, e.g., [a-fA-F0-9] in any locale other than C/POSIX, you're
going to have a bad time.

This is going to cause problems, and it's a shame the developers note this,
but then seem to wash their hands of it. Perhaps glibc should do what the
programmer means, not what they say. It's obvious the regex above intends to
validate hex digits, so glibc should do that, standards (and locale) be
damned, unless the programmer explictly opts in to the technically correct
behavior. The documentation even says:

> Therefore, using [a-z] does not make much sense except in the C/POSIX
> locale.

If that range doesn't make sense except in the C/POSIX locale, then why
interpret it in any other locale? Come on glibc... help us out here.

$0.02.

~~~
zeroimpl
This suggests the only safe portable way to represent hex is one of these?
[1234567890abcdefABCDEF] or [\dabcdefABCDEF] That sucks. Fortunately I don't
use C, but I'm worried about this finding its way into other languages...

~~~
js2
For a POSIX regex, the correct way to match a hex digit is to use is
[[:xdigit:]].

~~~
zeroimpl
Right - would be great if more languages supported that, specifically
Javascript.

~~~
js2
Javascript defines regex ranges in terms of UTF-16 code units w/no
consideration for locale, so you can use [a-fA-F0-9] and it will work as
expected.

The issue discussed in this bug is only relevant to POSIX regular expression
ranges.

Perhaps you're lamenting that there are so many different flavors of regular
expression. I agree. Just the other day I had to give up on using a regex in a
CloudFormation template to validate input because I could not get it to work
as documented.

------
cpburns2009
This is bizarre. I thought _[0-9]_ was supposed to match only 0-9 (the 10
digits), while _\d_ was meant to match all digits including the various
Unicode variations.

~~~
jwilk
This is true for many regular expression dialects.

But this bug about POSIX regular expressions. POSIX doesn't define \d, and it
defines ranges such as [0-9] only in the POSIX locale.

Historically, glibc considered such ranges in the context of locale's
collation order. For example, in Estonian locale, [a-z] doesn't include the
letters tuvwxy, because that's how Estonian alphabet works:
[https://en.wikipedia.org/wiki/Estonian_orthography#Alphabet](https://en.wikipedia.org/wiki/Estonian_orthography#Alphabet)

This (somewhat surprising) semantics of character ranges is not new, but the
recent changes in glibc made it more spectacular.

------
emmelaich
Reminds me of Ruby's regexp behaviour at one point.

With ignorecase, !\W didn't match 'k' and 's' (only!)

[https://bugs.ruby-lang.org/issues/4044](https://bugs.ruby-
lang.org/issues/4044)

------
usr1106
So where are glibc regexps used in the most typical practical cases?

At least grep does not use them (at least not in unmodified form) as a comment
in the bug report notes.

I think changing such behavior is unacceptable, regardless of what the spec
says. Here I would to propose the API stability promise of the Linux kernel:
We don't break existing programs.

If somebody notes that the previous behavior is not correct according to the
spec a new posixly correct mode can be introduced. But it should not magically
become default.

------
detaro
Does this "pass through" to the regex implementations in any other languages,
or do those tend to implement their own parser or integrate different
libraries?

~~~
js2
Few languages use glibc for their regex implementation. PCRE is probably the
most commonly used third-party regex lib, but it depends. Python and a bunch
of languages have their own implementations.

------
zlynx
I guess we should all just force-set our C programs to the C locale.

~~~
patrec
This works till you want at least minimal UTF-8 support, such as when you're
working with actual English text as opposed to human-readable computerese.

~~~
kazinator
Sticking to the "C" locale will work even if you want UTF-8 support. You just
roll your own UTF-8 support.

I built the TXR language entirely without any of the harmful garbage that is
the ISO C/POSIX localization. It handles UTF-8 just fine.

The C localization stuff was developed too early, at a time when nobody had
any real experience with localization. Before Unicode, before the Internet.

Before _threads_! How the do you set it up so that one thread runs in one
locale and another in another? That's important if a global server is
servicing two different requests simultaneously from users in two different
locales. The idiotic C locale stuff relies on global variables.

You set some magic variables and, poof; numerous functions in your entire
image change their behavior, whether they are working with internationalized
data or not.

The setlocale function might as well be called fuck_my_program_please.

