
Glob Matching Can Be Simple and Fast Too - secure
https://research.swtch.com/glob
======
js2
_I have not looked at the other linear-time implementations to see what they
do, but I expect they all use one of these two approaches._

Python's glob() (fnmatch really) translates the glob to a regular expression
then uses its re library:

[https://github.com/python-
git/python/blob/715a6e5035bb21ac49...](https://github.com/python-
git/python/blob/715a6e5035bb21ac49382772076ec4c630d6e960/Lib/fnmatch.py#L72)

~~~
rsc
Thanks for pointing this out. I wrote a Python test program but forgot to run
that test in the data collection for the graphs. I will update them once the
tests finish running. Ironically, Python must be passing the glob to an
exponential-time regexp library, since Python is on the slow side of the
fence.

~~~
willvarfar
Russ, while you're around, thanks for re2! RE2::Set especially! But why is
RE2::Set so underdocumented and underpromoted? All regex libraries need this
functionality.

~~~
rsc
You're welcome. I think basically all of RE2 is underdocumented and
underpromoted. Why should RE2::Set be any different?

Seriously, though, it was a bit of an afterthought. A team at Google was |ing
together a ton of regexps and came to me for something better, so I wrote
RE2::Set. I'm glad it helps others too.

------
dexen
Previously: "Regular Expression Matching Can Be Simple And Fast" (2007)
[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)
The paper deals with "Thompson NFA" approach to regex, with low computational
complexity.

Other Russ' papers on regular expression matching:
[https://swtch.com/~rsc/regexp/](https://swtch.com/~rsc/regexp/)

~~~
Terr_
It's funny, I had an immediate "I've seen these graphs before" reaction" with
this article, and it turns out your first link is something I was working on
last month for a leetcode question.

It was probably one of the more useful/understandable pages I found in that
process.

------
FreeFull
Interesting how the Rust implementation of glob currently seems to be the
slowest out of the linear time implementations. I guess maybe not too much
optimisation effort was put into it?

~~~
kibwen
There's no implementation of globbing in Rust's standard library, so we'd need
more information regarding which implementation was used for the blog post.
I'm guessing one of two possibilities:

1\. The glob crate, which is stdlib ejecta from pre-1.0, and appears to be in
maintenance mode (it hasn't seen a "real" commit since Jan 2016).

2\. The regex crate, which is very actively developed (and in fact inspired by
rsc's writings and Go's regex implementation).

If it's the former, then indeed it's probably as disinteresting as "nobody has
bothered to ever benchmark this crate". But if it's the latter then I bet
burntsushi would be very interested!

~~~
zokier
On a glance, I don't see any support for globbing in regex crate.

~~~
adito
I guess it's the one that's part of ripgrep,

[https://github.com/BurntSushi/ripgrep/tree/master/globset](https://github.com/BurntSushi/ripgrep/tree/master/globset)

> This crate implements globs by converting them to regular expressions, and
> executing them with the regex crate.

------
avar
There's another way for glob() implementations to mitigate these sort of
patterns that Russ doesn't discuss, but can be inferred from a careful reading
of the different examples in this new glob() article & the 2007 regex article.

In the regex article he notes that e.g. perl is subject to pathological
behavior when you match a?^na^n against an a^n:

    
    
        $ time perl -wE 'my $l = shift; my $str = "a" x $l; my $rx = "a?" x $l . $str; $str =~ /${rx}/' 28
        real    0m13.278s
    

However changing the pattern to /${rx}b/ makes it execute almost instantly.
This is because the matcher will look ahead for fixed non-pattern strings
found in the pattern, and deduce that whatever globbing we're trying to match
now it can't possibly matter if the string doesn't have a "b" in it.

I wonder if any globbing implementations take advantage of that class of
optimization, and if there's any cases where Russ's suggested solution of not
backtracking produces different results than you'd get by backtracking, in
particular with some of the extended non-POSIX glob syntax out there.

~~~
rurban
That's not globbing but using the regex matcher. perl's glob does the same as
PHP, it simply calls the system glob. So I'm curious why the graph displays it
as exponential on linux, where it should be linear.

In your example pcre2 performs much better than perl btw: it errors with match
limit exceeded (-47), while perl happily burns exponential CPU. It's now even
worse than before Russ' original perl article. Now it descends into heap
memory, before only into the stack. So now it will keep crunching forever on
the whole heap, while before the perl 5.10 rewrite triggered by Russ it died
fast on stack overflow.

~~~
rsc
Empirically, Perl's glob does not call the system glob. On my Linux system
Perl (5.18.2) is slow but the system glob is fast. Create a file named
/tmp/glob/$(perl -e 'print "a"x100') in an otherwise empty /tmp/glob and then
try:

    
    
      $ cat tglob.pl
      #!/usr/bin/perl
      
      use Time::HiRes qw(clock_gettime);
      
      $| = 1;
      chdir "/tmp/glob" || die "$!";
      for($i=0; $i<9; $i++) {
          $pattern = ("a*"x$i) . "b";
          $t = clock_gettime(CLOCK_REALTIME);
          $mul = 10;
          if($i >= 5){ 
              $mul = 1;
          }
          for($j=0; $j<$mul; $j++) {
              glob $pattern;
          }
          $t1 = clock_gettime(CLOCK_REALTIME);
          printf("%d %.9f\n", $i, ($t1-$t)/$mul);
      }
      $ perl -v
      
      This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
      ...
      $ perl tglob.pl
      0 0.000004911
      1 0.000016212
      2 0.000088072
      3 0.002416682
      4 0.030226517
      5 0.452545881
      6 6.872966528
      ^C
    

You're the second person to claim that Perl calls the system glob though
(someone in my blog comments did too). Maybe different versions of Perl do
different things? This is Ubuntu 14.04 if that matters.

~~~
demerphq
No you are right. Modern perls use the BSD code in File::Glob.

~~~
rsc
More discussion at
[https://research.swtch.com/glob#comment-3272455315](https://research.swtch.com/glob#comment-3272455315)

------
eriknstr
OP, what version(s) of the BSD libc did you test? What OS, which version of
the OS.

macOS only? NetBSD? FreeBSD? OpenBSD?

If you tested on FreeBSD, please file a bug at
[https://bugs.freebsd.org/bugzilla/enter_bug.cgi?product=Base...](https://bugs.freebsd.org/bugzilla/enter_bug.cgi?product=Base%20System)

I'm not a project member but I'm a user of the system so it's in my interest
that issues like this are resolved.

Please let me know whether or not you file a bug so that if you do I don't
duplicate bug reports and if you don't I can do some benchmarking myself.

~~~
rsc
I copied the glob implementation from one of the BSDs - I believe FreeBSD -
into a standalone C program and ran that program on the same Linux system as
the rest of the tests. Here's the version that tests the system glob. If you
run it on your FooBSD systems you can see whether it runs quickly or not. The
program assumes that you've already done:

    
    
      rm -rf /tmp/glob
      mkdir /tmp/glob
      cd /tmp/glob
      touch $(perl -e 'print "a"x100')
    

And here's the program:

    
    
      #include <stdio.h>
      #include <glob.h>
      #include <unistd.h>
      #include <string.h>
      #include <stdlib.h>
      #include <dirent.h>
      #include <time.h>
      
      int
      main(void)
      {
          glob_t g;
          char pattern[1000], *p;
          struct timespec t, t2;
          double dt;
          int i, j, k;
      
          chdir("/tmp/glob");
          setlinebuf(stdout);
          
          int mul = 1000;
          for(i = 0; i < 100; i++) {
              p = pattern;
              for (k = 0; k < i; k++) {
                  *p++ = 'a';
                  *p++ = '*';
              }
              *p++ = 'b';
              *p = '\0';
              printf("# %d %s\n", i, pattern);
              clock_gettime(CLOCK_REALTIME, &t);
              for (j = 0; j < mul; j++) {
                  memset(&g, 0, sizeof g);
                  if(glob(pattern, 0, 0, &g) != GLOB_NOMATCH) {
                      fprintf(stderr, "error: found matches\n");
                      exit(2);
                  }
                  globfree(&g);
              }
              clock_gettime(CLOCK_REALTIME, &t2);
              t2.tv_sec -= t.tv_sec;
              t2.tv_nsec -= t.tv_nsec;
              dt = t2.tv_sec + (double)t2.tv_nsec/1e9;
              printf("%d %.9f\n", i, dt/mul);
              fflush(stdout);
              if(dt/mul >= 0.0001)
                  mul = 1;
              if(i >= 8 && dt/mul >= 10)
                  break;
          }
      }
    

I won't be filing any specific bugs myself. I mailed oss-security@ this
morning, which should reach relevant BSD maintainers, but more bug filing
can't hurt.

------
avar
Slightly off-topic, but anyone know what he's using to generate those inline
SVG graphs? I've been looking for some easy to use graphing library like that
to present similar performance numbers on a webpage.

~~~
rsc
I looked around but didn't find anything I liked. The regexp article [1] uses
jgraph [2] to plot the data as eps, then ghostscript to turn eps to png. That
no longer works in a world of variable-dpi screens, so for this article I
generated SVGs and inlined them into the HTML. The SVGs are generated by a
custom program I wrote for the job [3], to mimic the old jgraph graphs. It's
not pretty code, but it gave me a lot of control over the presentation and
produced decent results. You're welcome to adapt it if you like.

[1]
[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

[2]
[https://web.eecs.utk.edu/~plank/plank/jgraph/jgraph.html](https://web.eecs.utk.edu/~plank/plank/jgraph/jgraph.html)

[3]
[https://research.swtch.com/globgraph.go](https://research.swtch.com/globgraph.go)

------
lexpar
Not sure if OP is author, but if you are, just to inform you, there is a small
typo in this paragraph:

"Unfortunately, none of _tehse_ protections address the cost of matching a
single path element of a single file name. In 2005, CVE-2005-0256 was issued
for a DoS vulnerability in WU-FTPD 2.6.2, because it ran for a very long time
finding even a single match during:"

Very informative article. Thanks for it!

~~~
rsc
Thanks, fixed the typo (author, not OP).

------
tyingq
The bsd derived glob has other functionality that I assume isn't simple or
fast:

    
    
      perl -MFile::Glob=bsd_glob -e 'print bsd_glob("{{a,b,c}{1,2,3}{{yuck,Yuck},{urgh,URGH}}}\n")'
    

Produces 36 lines representing all the iterations. Nest a bit deeper and it
gets unwieldy.

~~~
loeg
Those are actually expanded (in the recursive fashion you would expect) before
any "star" matching is done.

------
maweki
I wonder whether it would help to match from both sides (start and end)
simultaneously, since you know you're not looking in the middle of the string.
You also don't care about capture groups.

~~~
rsc
It would help this example. It wouldn't help the general case.

~~~
maweki
But in general the reversal of a glob pattern is trivial while the reversal of
the equivalent regular expression is not, no?

~~~
rsc
Reversing a regular expression and reversing a glob are about the same: you
just flip everything around and you're done. In fact RE2 uses this fact to
speed up regexp searches. See
[https://swtch.com/~rsc/regexp/regexp3.html#submatch](https://swtch.com/~rsc/regexp/regexp3.html#submatch)
and scan ahead a bit for "Run the DFA backward".

------
mixu
For fun, I ran this against node-glob ( [https://github.com/isaacs/node-
glob](https://github.com/isaacs/node-glob) ).

Looks like it exhibits the slower behavior:

    
    
      n,elapsed
      1,0.07
      2,0.07
      3,0.07
      4,0.07
      5,0.16
      6,1.43
      7,19.90
      8,240.76
    

See this gist for the script
[https://gist.github.com/mixu/e4803da16e42439480eba2b29fa4448...](https://gist.github.com/mixu/e4803da16e42439480eba2b29fa44484)

------
JdeBP
> _Graphical FTP clients typically use the MLST and MLSD commands_

Do not count WWW browsers amongst the number of those graphical FTP clients.
The common WWW browsers that speak FTP use LIST or LIST -l . With the
exception of Google Chrome when it thinks that it is talking to a VMS program,
they do not pass pattern arguments, though.

------
libre-man
I tested Common Lisp. SBCL seems to be exponential while Clozure CL is not.

However it should be noted that it is non portable to do globbing in Common
Lisp, so I expect most users implement it using something CL-FAD or OSICAT and
CL-PPCRE, and CL-PPCRE is efficient.

------
E6300
I've been playing around with my own glob implementation. From what I've seen,
the simplified algorithm mentioned in the article wouldn't be able to handle
question marks. In particular, I don't think a non-backtracking algorithm can
handle a pattern like "?a _?a_?a _?a_?b". I've been working to minimize the
worst-case behavior, but it's tricky.

~~~
JdeBP
Your semantics for question marks are wrong. Question marks matching zero or
one characters are the semantics for IBM/Microsoft command interpreters, for
reasons that go back all of the way to CP/M. (Strictly speaking the original
semantics amounted to, because of the 8.3 field padding, question marks
matching a character, the end of the string, or zero characters if immediately
before the dot.)

In POSIX, question marks match _exactly one_ character, always. There's no
need for backtracking.

* [http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3...](http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3_chap02.html#tag_18_13_01)

~~~
E6300
HN screwed up my comment. It's not supposed to be "question mark, a, question
mark, a, ...". I meant to write "question mark, a, star, question mark, a,
star, ...".

~~~
JdeBP
That would be backtracking caused by those stars, _not_ by the question marks.
Hacker News didn't screw up the part where you wrote "wouldn't be able to
handle question marks".

~~~
E6300
A pattern like "star a star a star a star" can be handled by the algorithm
described in the article:

> Consider the pattern "a star bx star cy star d". If we end the first star at
> the first bx, we have the rest of the name to find the cy and then the d.
> Using any later bx can only remove choices for cy and d; it cannot lead to a
> successful match that using the first bx missed. So we should implement a
> star bx star cy star d without any second-guessing, as “find a leading a,
> then find the earliest bx after that, then find the earliest cy after that,
> then find a trailing d, or else give up.”

This algorithm doesn't work if the pattern has question marks.

~~~
rsc
Yes it does.

------
mlgh
Sorry, but the implementation posted is O(|pattern| * |name|), not linear.
[http://ideone.com/2xCXyY](http://ideone.com/2xCXyY)

~~~
burntsushi
The size of the pattern is held as constant. Or more accurately, "it's linear"
is an abbreviated form of "it's linear in the size of the text searched."

------
jankedeen
How about the default sort? Ouch or no ouch?

------
BuuQu9hu
We independently reinvented an adaptation of this algorithm for Monte's
"simple" quasiliteral, which does simple string interpolation and matching.
The code at [https://github.com/monte-
language/typhon/blob/master/mast/pr...](https://github.com/monte-
language/typhon/blob/master/mast/prelude/simple.mt#L68-L121) is somewhat
similar in appearance and structure to the examples in the post.

    
    
      def name := "Hackernews"
      # greeting == "Hi Hackernews!"
      def greeting := `Hi $name!`
      # language == "Lojban"
      def `@language is awesome` := "Lojban is awesome"
    

A quirk of our presentation is that adjacent zero-or-more patterns degenerate,
with each subsequent pattern matching the empty string. This mirrors the
observation in the post that some systems can coalesce adjacent stars without
changing the semantics:

    
    
      # one == "", two == "cool"
      def `adjacent @one@two patterns` := "adjacent cool patterns"

~~~
triangleman
hey BuuQu9hu, your comment started out as [dead] for some reason. You may have
been inadvertently hellbanned.

~~~
proaralyst
...hellbanned?

~~~
darklajid
Being (silently) hidden so that your posts look okay to you but don't show up
for others (unless they have 'show dead' set to on).

It's an effective filtering mechanism, but some consider it rather cruel and
try to notify posters about it.

~~~
triangleman
False positives suck. There are people coming on here day after day, posting
valuable or at least worthy comments, and nobody can see them. They waste
their time unknowingly. It is cruel.

------
oconnore
Why write a glob engine at all when you already have a fast regex
implementation that can match both exact paths and plausible subtrees?

The bulk of the haskell code to do this:

    
    
        parseGlob :: Char -> Char -> String -> Parser Glob
        parseGlob escC sepC forbid =
            many1' (gpart <|> sep <|> glob <|> alt) >>= return . GGroup . V.fromList
          where gpart = globPart escC (sepC : (forbid ++ "{*")) >>= return . GPart
                sep = satisfy (== ch2word sepC) >> return GSeparator
                alt = do
                  _ <- AttoC.char '{'
                  choices <- sepBy' (GEmpty `option` parseGlob escC sepC (",}" ++ forbid)) (char ',')
                  _ <- AttoC.char '}'
                  return $ GAlternate $ V.fromList choices
                glob = do
                  res <- takeWhile1 (== ch2word '*')
                  if B.length res == 1 then
                    return GSingle
                  else
                    return GDouble
    
        wrapParens s = T.concat ["(", s, ")"]
    
        globRegex :: Char -> Glob -> T.Text
        globRegex sep  GSingle       = T.concat ["([^", T.singleton sep, "]*|\\", T.singleton sep, ")"]
        globRegex _    GDouble       = ".*"
        globRegex _    GEmpty        = ""
        globRegex sep  GSeparator    = T.singleton sep
        globRegex sep (GRepeat a)    = T.concat ["(", T.concat (V.toList $ fmap (globRegex sep) a), ")*"]
        globRegex sep (GGroup a)     = T.concat $ V.toList $ fmap (globRegex sep) a
        globRegex _   (GPart p)      = T.concatMap efun base
          where base = TE.decodeUtf8 p
                escChars = S.fromList ".[]()\\{}^$*+"
                efun c = if S.member c escChars
                         then T.concat ["\\", T.singleton c]
                         else T.singleton c
        globRegex sep (GAlternate a) =
            if V.null alts
            then ""
            else T.concat [altsStr, if hasEmpty then "?" else ""]
          where hasEmpty = isJust $ V.find (== GEmpty) a
                alts = fmap (globRegex sep) $ V.filter (/= GEmpty) a
                altsStr = wrapParens $ T.intercalate "|" $ V.toList alts

~~~
f2f

        Why write a glob engine at all when you already have a fast regex
    

you didn't read the article. there are glob implementations that do just that.

~~~
oconnore
The bulk of the article is about how the Go implementation avoids this
behavior without just converting to regex.

~~~
hkeide
No, the Go standard library implementation is mentioned in a single paragraph.
The Go code in the article is written simply as an illustration.

------
gwu78
[https://github.com/skarnet/execline/raw/master/src/execline/...](https://github.com/skarnet/execline/raw/master/src/execline/elglob.c)

[https://github.com/skarnet/execline/raw/master/src/libexecli...](https://github.com/skarnet/execline/raw/master/src/libexecline/exlsn_elglob.c)

Simple.

[http://www.in-ulm.de/~mascheck/various/argmax/](http://www.in-
ulm.de/~mascheck/various/argmax/)

    
    
       execlineb -c 'elglob a /*/*/*/* ls $a'
    

(statically-linked execlineb)

If I am not mistken, ARG_MAX will be the limit.

Straightforward.

~~~
JdeBP
That's a wrapper around the C library's glob() function, which is what the
headlined article was looking at. What point are you trying to make?

