
\d less efficient than [0-9] - mserdarsanli
http://stackoverflow.com/questions/16621738/d-less-efficient-than-0-9
======
0x0
I wonder what kind of security vulnerabilities could be looming in validators
not expecting non-ascii 0-9 digits and using this regex?

~~~
trebor
As of PHP 5.3, PHP-powered software is safe. Using

    
    
        is_numeric('١٣٦٨') // -> false
        preg_match('/\d/', '١٣٦٨') // -> no match / false
        filter_var('١٣٦٨', FILTER_VALIDATE_INT) // -> false
    

Which I'm thankful for. I should hope that most people understand base-10 and
ascii numbers. I don't want to have to worry about properly
validating/handling unicode characters with number parsing.

~~~
0x0
In PHP 5.4.15, I get:

    
    
      var_dump(preg_match('/\d/u', '١٣٦٨')) -> 1
      var_dump(preg_match('/\d/', '١٣٦٨'))  -> 0

------
laumars
Regex is a really powerful tool, but sometimes I wonder just how well people
actually understand it as the vast majority of people (myself included) seem
to be self taught in the syntax - only learning the bits they need as and when
they need it.

The problem is, regular expressions is packed full of counter intuitive
idiosyncrasies which make perfect sense once they're explained, but are far
from obvious. Take this for example:

    
    
        s/(^\s+|\s+$)//g
    

is slower than running two separate regex, like so:

    
    
        s/^\s+//;
        s/\s+$//;
    

So it does make me wonder the number of bugs that have been introduced to
software by bad regex.

~~~
rhizome
Use the s/, Luke

    
    
        s/^\s+?(.*)\s+?$/$1/g

~~~
conroe64
That wouldn't work. First, it will only grab at only one whitespace character
at the beginning and at the end. Second, if there was whitespace at the
beginning or the end but not both, it won't match at all. "^\s* (.* ?)\s*
$/$1/g" would work.

~~~
rhizome
point being: get rid of the anchors in the alternation.

------
Aurel1us
Short answer: \d includes all the Unicode characters from
<http://www.fileformat.info/info/unicode/category/Nd/list.htm>

~~~
ars
Is that actually a good thing? If I'm using \d to validate numbers (for
example to check before string to int conversion, or IP address, phone number,
or any other use), other unicode digits are not helpful to me.

It's great to support unicode, but I don't think the \d should have been
extended this way. Add a \ud or something.

~~~
Tuna-Fish
Given that the category is specifically "decimal digit", I think it's good, so
long as the number parsing code accepts them all too.

~~~
dllthomas
Yes. Assuming that, it's good. I think that assumption is likely to be invalid
in many cases, though.

------
joosters
The quoted benchmarks all complete in fractions of a second. Not a good sign.
They _may_ be reliable results, performed accurately, but why risk it?

IMO you should be running something for much longer, to protect against random
short spurious events. e.g. a task reschedule, interrupts, etc could add
significant variances. It wouldn't hurt to add a few more zeros to the loop
and wait a minute for the results.

~~~
stephencanon
Those events are all on the order of micro-seconds, far shorter than the
benchmark duration. A tenth of a second is an eternity on a modern CPU.

~~~
scott_s
One of those events, yes. But it's possible for the system to be experiencing
a bursty workload unrelated to your benchmark, and many of those events may
happen. There's also the problems of startup effects, both at the high level
(the VM, which in this case is .Net), the medium level (major and minor page
faults) and the low level (caches).

My rule of thumb is that benchmarks which are supposed to be bound by the
processor and memory should last at least 60 seconds.

~~~
stephencanon
VM startup effects I can get behind as a confound; page faults and cache
effects are below millisecond level (filling the beefiest Sandybridge Xeon L3
cache you can buy from a completely cold state is on the order of 1
millisecond, and a micro benchmark like this doesn’t come close to using that
much data).

I would also note that one is sometimes in the position of needing to measure
performance of a compute-intensive task that is latency-critical but will not
be running constantly; in such a scenario, using long-running benchmarks can
be misleading because the processor will become thermally constrained and drop
in and out of lower voltage/frequency bands, further confounding measurements.

I agree with you that a tenth of a second is on the shorter side of what I
would like to see in such a benchmark, but I don’t think the situation is as
dire as your first post suggested; unless the system is _exceptionally_ noisy,
the measurements seem to be valid, despite the relatively short duration. 60
seconds is overkill for a simple task like this.

~~~
scott_s
Again, it's _repeated_ page faults and cache effects.

When you run experiments, you want to draw conclusions. To have confidence in
your conclusions, you want to eliminate as many variables as possible. In my
work, I set the time of the benchmark high enough that I am confident that it
is very unlikely for these effects to have a significant influence on the
results. When you're drawing conclusions and publishing the results that will
be scrutinized by peers, "overkill" is the way to go.

Also note that I was not the first poster on this subject.

~~~
stephencanon
You cannot eliminate confounds by simple over-measurement. “Overkill” provides
false confidence.

The only way to eliminate confounds is to understand them, and either control
for them or bound them to an acceptable error tolerance. For a simple
benchmark such as this, cache misses and page faults reach steady state within
the first millisecond of operation; the error they contribute to the
measurement of a .1s benchmark (even in aggregate) is no more than 1% — almost
surely acceptable.

I have no experience with .Net, and would not care to make any estimates on
the contribution of VM startup time, but the experiment in question does not
include the VM startup in the measurement.

If a system were so noisy as to have interrupt storms on the order of .1s,
then I would not be comfortable with timings that run for 60s either. I would
much rather have statistics on 100 measurements of .1s each, which would make
clear the impact of such anomalies (while still being faster to gather). There
are many events that can make such measurements slower, but almost none that
can make them faster; the distribution of the measurements is typically well-
modeled by a Poisson distribution with bias. If one is actually trying to
eliminate the effect of those events from the measurement, taking the minimum
over many short samples is actually much closer to the truth than averaging
over one long sample. If instead one is trying to include the effect of such
events, then a different statistic would be in order.

------
Erwin
Python's methods on unicode strings also apply this logic. E.g.:

    
    
         >>>  u'١٣٦٨'.isdigit()
         True
    
        >>> int(u'١٣٦٨')
        1368
    

I suppose this could be potentially abused if you are storing and displayeing
what is supposed to used as a number as unicode text, but later convert it to
a number. E.g. an online shop where you are asked whether you want to pay '5꯸'
for some item which looks like 5 plus some weird square, but is really
int(u'5꯸') => 58 --
<http://www.fileformat.info/info/unicode/char/abf8/index.htm>

------
foobar__
The fact that character ranges like [a-z] can depend on the value of
LC_COLLATE is also something not many people are aware of.

    
    
      $ echo "ä" | LC_COLLATE=C grep '[a-z]'
      $ echo "ä" | LC_COLLATE=en_US.UTF-8 grep '[a-z]'
      ä
    

For common values of LC_COLLATE, the range [a-z] does not exclude accented
characters and umlauts.

------
rwmj
I was a bit surprised that Perl does _not_ seem to be matching Unicode digits.
Anyone know why?

    
    
        $ echo '0' | perl -pe 'print "yes: " if m/\d/'
        yes: 0
        $ echo '੧' | perl -pe 'print "yes: " if m/\d/'
        ੧

~~~
xonea
You have to tell perl to expect utf8 from stdin (switch -C).

    
    
      $ echo '੧' | perl -C -pe 'print "yes: " if m/\d/'
      and
      $ perl -e 'use utf8; print "yes\n" if "੧" =~ m/\d/;'
    

both work :)

~~~
pooriaazimi
`man perlunicode` is chockfull of utf8-related stuff (and it's _looong_ ):
<http://perldoc.perl.org/5.14.0/perlunicode.html>

------
fleitz
The test code creates a new regex every time, would be interesting to see how
it works with a compiled and reused regex.

~~~
joosters
It compiles it once and then matches it against 10000 strings.

------
jnotarstefano
There seems to be a tiny bit of difference in Ruby too. This code:

    
    
        require 'benchmark'
    
        def random_string(length)
          result = (1..length).map { (65+rand(26)).chr }.join
          result[rand(length)] = rand(10).to_s if rand > 0.5
          result
        end
    
        Benchmark.bmbm do |b|
          b.report("\\d") do 
            (1..1000).count { random_string(1000).match(/\d/) }     
          end
    
          b.report("[0-9]") do 
            (1..1000).count { random_string(1000).match(/[0-9]/) }
          end
    
          b.report("[0123456789]") do 
            (1..1000).count { random_string(1000).match(/[0123456789]/) }
          end
        end
    

gives:

    
    
        ~/Code/ruby% ruby regex.rb
        Rehearsal ------------------------------------------------
        \d             0.690000   0.000000   0.690000 (  0.712500)
        [0-9]          0.690000   0.000000   0.690000 (  0.703990)
        [0123456789]   0.680000   0.010000   0.690000 (  0.705759)
        --------------------------------------- total: 2.070000sec
        
                           user     system      total        real
        \d             0.710000   0.000000   0.710000 (  0.791722)
        [0-9]          0.700000   0.000000   0.700000 (  0.708210)
        [0123456789]   0.690000   0.010000   0.700000 (  0.713355)

------
justanotherbody
Noted here that modifying \d to only include [0-9] yields \d more efficient
<http://stackoverflow.com/a/16622773/1943429>

------
ams6110
I tend to use ranges (e.g. [0-9]) as they seem to me to be more standard than
the token for "any digit" (often \d, but in elisp (Emacs) it's [:digit:])

------
belper
Interesting to see these missing, which are 1 and 1, respectively: [一, 壹]

~~~
anonymous
I think that's because they are treated as words, rather than digits. The same
way that \d won't match "einz".

------
hdragomir
in C#

~~~
hdragomir
Here are some results in Javascript: <http://jsperf.com/digit-regex>

------
conchulio
Maybe the order in which the Regexes are evaluated is also important due to
caching etc. Has anyone tested if results are different when changing the
order?

------
dbbolton
What is the need for those "mathematical monospace/bold/sans" characters?
Should that be a font issue?

~~~
claudius
Fonts are about different representations of the same symbols.
Monospace/bold/sans are _different_ symbols in mathematics.

