Hacker News new | past | comments | ask | show | jobs | submit login
Glob Matching Can Be Simple and Fast Too (swtch.com)
333 points by secure on Apr 24, 2017 | hide | past | web | favorite | 95 comments

I have not looked at the other linear-time implementations to see what they do, but I expect they all use one of these two approaches.

Python's glob() (fnmatch really) translates the glob to a regular expression then uses its re library:


Thanks for pointing this out. I wrote a Python test program but forgot to run that test in the data collection for the graphs. I will update them once the tests finish running. Ironically, Python must be passing the glob to an exponential-time regexp library, since Python is on the slow side of the fence.

Russ, while you're around, thanks for re2! RE2::Set especially! But why is RE2::Set so underdocumented and underpromoted? All regex libraries need this functionality.

You're welcome. I think basically all of RE2 is underdocumented and underpromoted. Why should RE2::Set be any different?

Seriously, though, it was a bit of an afterthought. A team at Google was |ing together a ton of regexps and came to me for something better, so I wrote RE2::Set. I'm glad it helps others too.

Your previous article on regex engines I think indicated that Python's is backtracking and with poor performance.

Indeed, I just didn't want to presume that nothing had changed in the past 10 years (but apparently not).

Python's regex library compiles the regex to a sequence of opcodes and then does recursive descent, so yes, it's exponential.

Yep, Python's "re" is the usual PCRE+X story.

Python has its own engine which is not based on PCRE AFAIK:


I noticed you linked to python-git instead of the main Python repo. The whole project has been moved as of a couple months.


Yeah, that was a mistake on my part but by the time somone pointed it out I couldn't edit my comment anymore. Sorry. :-(

Here is the same but with star and doublestar for handling slashes:

https://github.com/borgbackup/borg/blob/master/src/borg/shel... http://borgbackup.readthedocs.io/en/stable/usage.html#borg-h... ("Shell-style patterns, selector sh:")

That implementation looks pretty inefficient, creating lots of strings that are thrown away immediately. Could it not be refactored into a generator combined with a string.join?

I think Python's string type has an optimization that concatenation on a string where the LHS only has a single reference to it will reuse the existing string when possible. You still have to grow the buffer periodically, of course, but it's not creating as much garbage as it looks like.

> I think Python's string type has an optimization that concatenation on a string where the LHS only has a single reference to it will reuse the existing string when possible.

It's specifically a CPython optimisation, it may (and usually will) not hold up on alternate implementations like pypy or jyton.

Sure, but the function in question lives in CPython's own implementation of its core libraries, so it seems reasonable for it to rely on that implementation.

If PyPy or Jython want to reuse that core library code, it's on them to make whatever changes are needed to make it fast on those implementations.

Usual use would be "translate pattern once, compile it and then match 1000 file names against it".

Ahh yes, it's cached in fnmatchcase. It's pretty fast as well, <10µs for a basic match.

A cache without any eviction! Probably should not expose `fnmatch` to the network where the pattern varies.

That's an old version of Python source from a mirror. It uses functools.lru_cache with a maxsize of 256 now: https://github.com/python/cpython/blob/fcfe80ec2592fed8b3941...

Gah, thanks for the proper link. I should've double checked (and realized it since no updates in 8 years is unlikely).

The RE engine here is only used as a fast VM; so the pathological cases from the article should still apply. So apart from the memory use issues there's also that.

It should be updated with a functools.lru_cache decorator I think.

Previously: "Regular Expression Matching Can Be Simple And Fast" (2007) https://swtch.com/~rsc/regexp/regexp1.html The paper deals with "Thompson NFA" approach to regex, with low computational complexity.

Other Russ' papers on regular expression matching: https://swtch.com/~rsc/regexp/

It's funny, I had an immediate "I've seen these graphs before" reaction" with this article, and it turns out your first link is something I was working on last month for a leetcode question.

It was probably one of the more useful/understandable pages I found in that process.

Interesting how the Rust implementation of glob currently seems to be the slowest out of the linear time implementations. I guess maybe not too much optimisation effort was put into it?

There's no implementation of globbing in Rust's standard library, so we'd need more information regarding which implementation was used for the blog post. I'm guessing one of two possibilities:

1. The glob crate, which is stdlib ejecta from pre-1.0, and appears to be in maintenance mode (it hasn't seen a "real" commit since Jan 2016).

2. The regex crate, which is very actively developed (and in fact inspired by rsc's writings and Go's regex implementation).

If it's the former, then indeed it's probably as disinteresting as "nobody has bothered to ever benchmark this crate". But if it's the latter then I bet burntsushi would be very interested!

I googled for Rust glob match and ended up at the glob crate. This is my Rust program:

  extern crate glob;
  extern crate time;
  use glob::glob;
  use time::PreciseTime;
  fn main() {
      for i in 0..100 {
          let pattern = "a*".repeat(i)+"b";
          let t = PreciseTime::now();
          let mul = 100;
          for j in 0..mul {
              for entry in glob(&pattern).expect("Failed to read glob pattern") {
                  match entry {
                      Ok(path) => println!("{:?}", path.display()),
                      Err(e) => println!("{:?}", e),
          let t1 = PreciseTime::now();
          println!("{:?} {:?}", i, (t.to(t1).num_nanoseconds().expect("ok") as f64)/1e9/(mul as f64))
Honestly I'm not too worried about Rust's performance here: it's on the right side of the fence.

Thanks for the info! In the course of this conversation I've also been made aware of the globset crate, by the same author as the regex crate: https://crates.io/crates/globset . I'm going to start a conversation about seeing whether anybody might want to start deprecating glob in favor of globset, in the interest of having just one "blessed" library for this.

FWIW, the globset crate works by converting globs to regexes (so it is therefore also linear time). The downside is that you need to pay the overhead of regex compilation. In general, it should be pretty small since globs probably don't result in huge regexes, but I bet compilation is slower than what's in the `glob` crate currently. (I've never benchmarked it though, so I'm just guessing.)

Doesn't your regex crate avoid compilation of various types of "simple" regexes, and instead just translates them to simple string library calls? If so, wouldn't most globs be converted to such?

You have the general idea right in that the matching routine itself might completely avoid the regex engine. But that doesn't mean the implementation saves every possible instruction during compilation. For the most part, compilation is assumed to be amortized somehow, so as long as it's reasonable, not much work has gone into making it as fast as possible.

I assume compiling/translating globs to regexp also takes some time. I wonder if that could be meaningful in some pathological cases?

It's meaningful when you can't amortize compilation. I wouldn't call those pathological, but they seem niche.

The translation from glob to regex at the syntax level is likely a non-factor. It's the compilation of the regex that probably starts to matter.

Why just deprecate? Can't it be made to proxy behind the scenes to globset?

I'd question if it should be considered maintenance mode if it is complete.

I understand why monolithic utility and frameworks need constant updating. It seems that in the "micro library" world, though, we should see and encourage "finished" libraries.

Maintenance mode just suggests that no one is currently trying to make it any faster.

Indeed, I draw a stark line between "maintenance mode" and "abandoned". The glob crate isn't abandoned, it still accepts patches, but it doesn't appear to be anyone's priority.

EDIT: It's also worth mentioning that there's an important mental distinction between a "1.0-level" maintenance mode crate and a pre-1.0 maintenance mode crate, as the former at least implies that somebody considers the crate to be marginally usable and feature-complete. glob is a pre-1.0 crate, so even though patches are accepted, I wouldn't necessarily consider it to be well-vetted without doing more research.

This makes sense, but seems unlikely to be what most people hear when you say something is in maintenance mode. In particular, "keep the lights on" is a death sentence for most projects.

On a glance, I don't see any support for globbing in regex crate.

I guess it's the one that's part of ripgrep,


> This crate implements globs by converting them to regular expressions, and executing them with the regex crate.

I was originally sort of hoping that the OP had written some code to implement globbing based on the regex crate, because that sounds like it would be interesting. :) (And because I forgot that the glob crate existed, frankly.)

There's another way for glob() implementations to mitigate these sort of patterns that Russ doesn't discuss, but can be inferred from a careful reading of the different examples in this new glob() article & the 2007 regex article.

In the regex article he notes that e.g. perl is subject to pathological behavior when you match a?^na^n against an a^n:

    $ time perl -wE 'my $l = shift; my $str = "a" x $l; my $rx = "a?" x $l . $str; $str =~ /${rx}/' 28
    real    0m13.278s
However changing the pattern to /${rx}b/ makes it execute almost instantly. This is because the matcher will look ahead for fixed non-pattern strings found in the pattern, and deduce that whatever globbing we're trying to match now it can't possibly matter if the string doesn't have a "b" in it.

I wonder if any globbing implementations take advantage of that class of optimization, and if there's any cases where Russ's suggested solution of not backtracking produces different results than you'd get by backtracking, in particular with some of the extended non-POSIX glob syntax out there.

That's not globbing but using the regex matcher. perl's glob does the same as PHP, it simply calls the system glob. So I'm curious why the graph displays it as exponential on linux, where it should be linear.

In your example pcre2 performs much better than perl btw: it errors with match limit exceeded (-47), while perl happily burns exponential CPU. It's now even worse than before Russ' original perl article. Now it descends into heap memory, before only into the stack. So now it will keep crunching forever on the whole heap, while before the perl 5.10 rewrite triggered by Russ it died fast on stack overflow.

Empirically, Perl's glob does not call the system glob. On my Linux system Perl (5.18.2) is slow but the system glob is fast. Create a file named /tmp/glob/$(perl -e 'print "a"x100') in an otherwise empty /tmp/glob and then try:

  $ cat tglob.pl
  use Time::HiRes qw(clock_gettime);
  $| = 1;
  chdir "/tmp/glob" || die "$!";
  for($i=0; $i<9; $i++) {
      $pattern = ("a*"x$i) . "b";
      $t = clock_gettime(CLOCK_REALTIME);
      $mul = 10;
      if($i >= 5){ 
          $mul = 1;
      for($j=0; $j<$mul; $j++) {
          glob $pattern;
      $t1 = clock_gettime(CLOCK_REALTIME);
      printf("%d %.9f\n", $i, ($t1-$t)/$mul);
  $ perl -v
  This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
  $ perl tglob.pl
  0 0.000004911
  1 0.000016212
  2 0.000088072
  3 0.002416682
  4 0.030226517
  5 0.452545881
  6 6.872966528
You're the second person to claim that Perl calls the system glob though (someone in my blog comments did too). Maybe different versions of Perl do different things? This is Ubuntu 14.04 if that matters.

It uses the wrong system glob :) It either calls out on VMS or Windows to a perlglob executable, or uses BSD glob from a shared lib. So whenever perl makes a choice it makes the wrong one.

No you are right. Modern perls use the BSD code in File::Glob.

    > That's not globbing but using the regex matcher.
Indeed. I'm just using the regex behavior to show that perl's regex matcher uses a general optimization to defeat patterns like the ones Russ is discussing using a strategy orthogonal to handling backtracking.

But Russ doesn't talk about using this strategy to optimize glob(), so it's worth pointing out that it could just as well be used in a glob() implementation.

I.e. just try to find a fixed part later in the pattern, and if it can't be found fail the entire match.

    > I'm curious why the graph displays [perl]
    > as exponential on linux, where it should be linear.
The glob() routine in perl uses a fork of a BSD glob which ships with perl: https://github.com/Perl/perl5/blob/blead/ext/File-Glob/bsd_g...

    > pcre2 performs much better than perl btw: it
    > errors with match limit exceeded (-47)
The pcre2 library will still perform better if you adjust the match limit so that the pattern actually matches, e.g. with --match-limit=1000000000 to pcre2grep. That finishes in around 1s with PCRE, 10s with Perl.

But in general a regex library can't be said to perform better just because it errors out with a match limit error sooner. That just means it's compiled with different defaults.

Scanning for fixed non-pattern strings would help this example. It wouldn't help the general case (which is to say, it wouldn't help always), whereas a better algorithm would.

OP, what version(s) of the BSD libc did you test? What OS, which version of the OS.

macOS only? NetBSD? FreeBSD? OpenBSD?

If you tested on FreeBSD, please file a bug at https://bugs.freebsd.org/bugzilla/enter_bug.cgi?product=Base...

I'm not a project member but I'm a user of the system so it's in my interest that issues like this are resolved.

Please let me know whether or not you file a bug so that if you do I don't duplicate bug reports and if you don't I can do some benchmarking myself.

I copied the glob implementation from one of the BSDs - I believe FreeBSD - into a standalone C program and ran that program on the same Linux system as the rest of the tests. Here's the version that tests the system glob. If you run it on your FooBSD systems you can see whether it runs quickly or not. The program assumes that you've already done:

  rm -rf /tmp/glob
  mkdir /tmp/glob
  cd /tmp/glob
  touch $(perl -e 'print "a"x100')
And here's the program:

  #include <stdio.h>
  #include <glob.h>
  #include <unistd.h>
  #include <string.h>
  #include <stdlib.h>
  #include <dirent.h>
  #include <time.h>
      glob_t g;
      char pattern[1000], *p;
      struct timespec t, t2;
      double dt;
      int i, j, k;
      int mul = 1000;
      for(i = 0; i < 100; i++) {
          p = pattern;
          for (k = 0; k < i; k++) {
              *p++ = 'a';
              *p++ = '*';
          *p++ = 'b';
          *p = '\0';
          printf("# %d %s\n", i, pattern);
          clock_gettime(CLOCK_REALTIME, &t);
          for (j = 0; j < mul; j++) {
              memset(&g, 0, sizeof g);
              if(glob(pattern, 0, 0, &g) != GLOB_NOMATCH) {
                  fprintf(stderr, "error: found matches\n");
          clock_gettime(CLOCK_REALTIME, &t2);
          t2.tv_sec -= t.tv_sec;
          t2.tv_nsec -= t.tv_nsec;
          dt = t2.tv_sec + (double)t2.tv_nsec/1e9;
          printf("%d %.9f\n", i, dt/mul);
          if(dt/mul >= 0.0001)
              mul = 1;
          if(i >= 8 && dt/mul >= 10)
I won't be filing any specific bugs myself. I mailed oss-security@ this morning, which should reach relevant BSD maintainers, but more bug filing can't hurt.

It's still exponential on FreeBSD CURRENT (per rsc's test program in the sibling thread):

    # 0 b
    0 0.000000661
    # 1 a*b
    1 0.000005204
    # 2 a*a*b
    2 0.000036157
    # 3 a*a*a*b
    3 0.001124974
    # 4 a*a*a*a*b
    4 0.046123932
    # 5 a*a*a*a*a*b
    5 0.577535034
    # 6 a*a*a*a*a*a*b
    6 8.666314375
    # 7 a*a*a*a*a*a*a*b
    7 119.796626615
(I didn't wait for 8.)

Please file a bug.

Slightly off-topic, but anyone know what he's using to generate those inline SVG graphs? I've been looking for some easy to use graphing library like that to present similar performance numbers on a webpage.

I looked around but didn't find anything I liked. The regexp article [1] uses jgraph [2] to plot the data as eps, then ghostscript to turn eps to png. That no longer works in a world of variable-dpi screens, so for this article I generated SVGs and inlined them into the HTML. The SVGs are generated by a custom program I wrote for the job [3], to mimic the old jgraph graphs. It's not pretty code, but it gave me a lot of control over the presentation and produced decent results. You're welcome to adapt it if you like.

[1] https://swtch.com/~rsc/regexp/regexp1.html

[2] https://web.eecs.utk.edu/~plank/plank/jgraph/jgraph.html

[3] https://research.swtch.com/globgraph.go

My guess is that it's Gnuplot. At least gnuplot has an SVG driver: http://gnuplot.sourceforge.net/demo_svg_4.6/

No giveaways in the source code - will probably have to wait for the author to comment!

Not sure if OP is author, but if you are, just to inform you, there is a small typo in this paragraph:

"Unfortunately, none of tehse protections address the cost of matching a single path element of a single file name. In 2005, CVE-2005-0256 was issued for a DoS vulnerability in WU-FTPD 2.6.2, because it ran for a very long time finding even a single match during:"

Very informative article. Thanks for it!

Thanks, fixed the typo (author, not OP).

The bsd derived glob has other functionality that I assume isn't simple or fast:

  perl -MFile::Glob=bsd_glob -e 'print bsd_glob("{{a,b,c}{1,2,3}{{yuck,Yuck},{urgh,URGH}}}\n")'
Produces 36 lines representing all the iterations. Nest a bit deeper and it gets unwieldy.

Those are actually expanded (in the recursive fashion you would expect) before any "star" matching is done.

There have been various CVEs already around brace expansion, but yes, it gets bad.

I wonder whether it would help to match from both sides (start and end) simultaneously, since you know you're not looking in the middle of the string. You also don't care about capture groups.

It would help this example. It wouldn't help the general case.

But in general the reversal of a glob pattern is trivial while the reversal of the equivalent regular expression is not, no?

Reversing a regular expression and reversing a glob are about the same: you just flip everything around and you're done. In fact RE2 uses this fact to speed up regexp searches. See https://swtch.com/~rsc/regexp/regexp3.html#submatch and scan ahead a bit for "Run the DFA backward".

For fun, I ran this against node-glob ( https://github.com/isaacs/node-glob ).

Looks like it exhibits the slower behavior:

See this gist for the script https://gist.github.com/mixu/e4803da16e42439480eba2b29fa4448...

> Graphical FTP clients typically use the MLST and MLSD commands

Do not count WWW browsers amongst the number of those graphical FTP clients. The common WWW browsers that speak FTP use LIST or LIST -l . With the exception of Google Chrome when it thinks that it is talking to a VMS program, they do not pass pattern arguments, though.

I tested Common Lisp. SBCL seems to be exponential while Clozure CL is not.

However it should be noted that it is non portable to do globbing in Common Lisp, so I expect most users implement it using something CL-FAD or OSICAT and CL-PPCRE, and CL-PPCRE is efficient.

I've been playing around with my own glob implementation. From what I've seen, the simplified algorithm mentioned in the article wouldn't be able to handle question marks. In particular, I don't think a non-backtracking algorithm can handle a pattern like "?a?a?a?a?b". I've been working to minimize the worst-case behavior, but it's tricky.

Your semantics for question marks are wrong. Question marks matching zero or one characters are the semantics for IBM/Microsoft command interpreters, for reasons that go back all of the way to CP/M. (Strictly speaking the original semantics amounted to, because of the 8.3 field padding, question marks matching a character, the end of the string, or zero characters if immediately before the dot.)

In POSIX, question marks match exactly one character, always. There's no need for backtracking.

* http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3...

HN screwed up my comment. It's not supposed to be "question mark, a, question mark, a, ...". I meant to write "question mark, a, star, question mark, a, star, ...".

That would be backtracking caused by those stars, not by the question marks. Hacker News didn't screw up the part where you wrote "wouldn't be able to handle question marks".

A pattern like "star a star a star a star" can be handled by the algorithm described in the article:

> Consider the pattern "a star bx star cy star d". If we end the first star at the first bx, we have the rest of the name to find the cy and then the d. Using any later bx can only remove choices for cy and d; it cannot lead to a successful match that using the first bx missed. So we should implement a star bx star cy star d without any second-guessing, as “find a leading a, then find the earliest bx after that, then find the earliest cy after that, then find a trailing d, or else give up.”

This algorithm doesn't work if the pattern has question marks.

Yes it does.

I made a type in my previous response. I wrote "can be handled" instead of "can't be handled". The algorithm described can't handle that case.

In addition to what JdeBP said about the semantics of question marks, there's always the option of using a proper automata translation (or translation to a regexp with a suitable underlying engine). Neither of those require backtracking.

Sorry, but the implementation posted is O(|pattern| * |name|), not linear. http://ideone.com/2xCXyY

The size of the pattern is held as constant. Or more accurately, "it's linear" is an abbreviated form of "it's linear in the size of the text searched."

How about the default sort? Ouch or no ouch?

We independently reinvented an adaptation of this algorithm for Monte's "simple" quasiliteral, which does simple string interpolation and matching. The code at https://github.com/monte-language/typhon/blob/master/mast/pr... is somewhat similar in appearance and structure to the examples in the post.

  def name := "Hackernews"
  # greeting == "Hi Hackernews!"
  def greeting := `Hi $name!`
  # language == "Lojban"
  def `@language is awesome` := "Lojban is awesome"
A quirk of our presentation is that adjacent zero-or-more patterns degenerate, with each subsequent pattern matching the empty string. This mirrors the observation in the post that some systems can coalesce adjacent stars without changing the semantics:

  # one == "", two == "cool"
  def `adjacent @one@two patterns` := "adjacent cool patterns"

hey BuuQu9hu, your comment started out as [dead] for some reason. You may have been inadvertently hellbanned.


Being (silently) hidden so that your posts look okay to you but don't show up for others (unless they have 'show dead' set to on).

It's an effective filtering mechanism, but some consider it rather cruel and try to notify posters about it.

False positives suck. There are people coming on here day after day, posting valuable or at least worthy comments, and nobody can see them. They waste their time unknowingly. It is cruel.

Why write a glob engine at all when you already have a fast regex implementation that can match both exact paths and plausible subtrees?

The bulk of the haskell code to do this:

    parseGlob :: Char -> Char -> String -> Parser Glob
    parseGlob escC sepC forbid =
        many1' (gpart <|> sep <|> glob <|> alt) >>= return . GGroup . V.fromList
      where gpart = globPart escC (sepC : (forbid ++ "{*")) >>= return . GPart
            sep = satisfy (== ch2word sepC) >> return GSeparator
            alt = do
              _ <- AttoC.char '{'
              choices <- sepBy' (GEmpty `option` parseGlob escC sepC (",}" ++ forbid)) (char ',')
              _ <- AttoC.char '}'
              return $ GAlternate $ V.fromList choices
            glob = do
              res <- takeWhile1 (== ch2word '*')
              if B.length res == 1 then
                return GSingle
                return GDouble

    wrapParens s = T.concat ["(", s, ")"]

    globRegex :: Char -> Glob -> T.Text
    globRegex sep  GSingle       = T.concat ["([^", T.singleton sep, "]*|\\", T.singleton sep, ")"]
    globRegex _    GDouble       = ".*"
    globRegex _    GEmpty        = ""
    globRegex sep  GSeparator    = T.singleton sep
    globRegex sep (GRepeat a)    = T.concat ["(", T.concat (V.toList $ fmap (globRegex sep) a), ")*"]
    globRegex sep (GGroup a)     = T.concat $ V.toList $ fmap (globRegex sep) a
    globRegex _   (GPart p)      = T.concatMap efun base
      where base = TE.decodeUtf8 p
            escChars = S.fromList ".[]()\\{}^$*+"
            efun c = if S.member c escChars
                     then T.concat ["\\", T.singleton c]
                     else T.singleton c
    globRegex sep (GAlternate a) =
        if V.null alts
        then ""
        else T.concat [altsStr, if hasEmpty then "?" else ""]
      where hasEmpty = isJust $ V.find (== GEmpty) a
            alts = fmap (globRegex sep) $ V.filter (/= GEmpty) a
            altsStr = wrapParens $ T.intercalate "|" $ V.toList alts

    Why write a glob engine at all when you already have a fast regex
you didn't read the article. there are glob implementations that do just that.

From the HN guidelines:

Please don't insinuate that someone hasn't read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."

The bulk of the article is about how the Go implementation avoids this behavior without just converting to regex.

No, the Go standard library implementation is mentioned in a single paragraph. The Go code in the article is written simply as an illustration.

I took it more as the bulk of the article explaining the differences and problems in algorithms. Just knowing that you can use a non-backtracking regex engine is fine, of course. Knowing how folks got it wrong is pedagogical in its own right.

As an easy additional point to the lesson, it might not be clear to everyone why you have to take care on which regex engine you move to.

Did you just write this now? Are you saying that the above parser is linear time worst case?

I wrote it a few months ago. The resulting regex is linear time worst case in thompson or dfa regex engines.





   execlineb -c 'elglob a /*/*/*/* ls $a'
(statically-linked execlineb)

If I am not mistken, ARG_MAX will be the limit.


That's a wrapper around the C library's glob() function, which is what the headlined article was looking at. What point are you trying to make?

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact