Glob Matching Can Be Simple and Fast Too

js2 · on April 24, 2017

I have not looked at the other linear-time implementations to see what they do, but I expect they all use one of these two approaches.

Python's glob() (fnmatch really) translates the glob to a regular expression then uses its re library:

https://github.com/python-git/python/blob/715a6e5035bb21ac49...

rsc · on April 24, 2017

Thanks for pointing this out. I wrote a Python test program but forgot to run that test in the data collection for the graphs. I will update them once the tests finish running. Ironically, Python must be passing the glob to an exponential-time regexp library, since Python is on the slow side of the fence.

willvarfar · on April 24, 2017

Russ, while you're around, thanks for re2! RE2::Set especially! But why is RE2::Set so underdocumented and underpromoted? All regex libraries need this functionality.

rsc · on April 24, 2017

You're welcome. I think basically all of RE2 is underdocumented and underpromoted. Why should RE2::Set be any different?

Seriously, though, it was a bit of an afterthought. A team at Google was |ing together a ton of regexps and came to me for something better, so I wrote RE2::Set. I'm glad it helps others too.

js2 · on April 24, 2017

Your previous article on regex engines I think indicated that Python's is backtracking and with poor performance.

rsc · on April 24, 2017

Indeed, I just didn't want to presume that nothing had changed in the past 10 years (but apparently not).

ot · on April 25, 2017

Python's regex library compiles the regex to a sequence of opcodes and then does recursive descent, so yes, it's exponential.

dom0 · on April 24, 2017

Yep, Python's "re" is the usual PCRE+X story.

js2 · on April 24, 2017

Python has its own engine which is not based on PCRE AFAIK:

https://github.com/python-git/python/blob/715a6e5035bb21ac49...

xapata · on April 24, 2017

I noticed you linked to python-git instead of the main Python repo. The whole project has been moved as of a couple months.

https://github.com/python/cpython

js2 · on April 25, 2017

Yeah, that was a mistake on my part but by the time somone pointed it out I couldn't edit my comment anymore. Sorry. :-(

dom0 · on April 24, 2017

Here is the same but with star and doublestar for handling slashes:

https://github.com/borgbackup/borg/blob/master/src/borg/shel... http://borgbackup.readthedocs.io/en/stable/usage.html#borg-h... ("Shell-style patterns, selector sh:")

orf · on April 24, 2017

That implementation looks pretty inefficient, creating lots of strings that are thrown away immediately. Could it not be refactored into a generator combined with a string.join?

munificent · on April 24, 2017

I think Python's string type has an optimization that concatenation on a string where the LHS only has a single reference to it will reuse the existing string when possible. You still have to grow the buffer periodically, of course, but it's not creating as much garbage as it looks like.

masklinn · on April 25, 2017

> I think Python's string type has an optimization that concatenation on a string where the LHS only has a single reference to it will reuse the existing string when possible.

It's specifically a CPython optimisation, it may (and usually will) not hold up on alternate implementations like pypy or jyton.

munificent · on April 26, 2017

Sure, but the function in question lives in CPython's own implementation of its core libraries, so it seems reasonable for it to rely on that implementation.

If PyPy or Jython want to reuse that core library code, it's on them to make whatever changes are needed to make it fast on those implementations.

dom0 · on April 24, 2017

Usual use would be "translate pattern once, compile it and then match 1000 file names against it".

orf · on April 24, 2017

Ahh yes, it's cached in fnmatchcase. It's pretty fast as well, <10µs for a basic match.

bluetech · on April 24, 2017

A cache without any eviction! Probably should not expose `fnmatch` to the network where the pattern varies.

benhoyt · on April 24, 2017

That's an old version of Python source from a mirror. It uses functools.lru_cache with a maxsize of 256 now: https://github.com/python/cpython/blob/fcfe80ec2592fed8b3941...

js2 · on April 24, 2017

Gah, thanks for the proper link. I should've double checked (and realized it since no updates in 8 years is unlikely).

dom0 · on April 24, 2017

The RE engine here is only used as a fast VM; so the pathological cases from the article should still apply. So apart from the memory use issues there's also that.

orf · on April 24, 2017

It should be updated with a functools.lru_cache decorator I think.

dexen · on April 24, 2017

Previously: "Regular Expression Matching Can Be Simple And Fast" (2007) https://swtch.com/~rsc/regexp/regexp1.html The paper deals with "Thompson NFA" approach to regex, with low computational complexity.

Other Russ' papers on regular expression matching: https://swtch.com/~rsc/regexp/

Terr_ · on April 24, 2017

It's funny, I had an immediate "I've seen these graphs before" reaction" with this article, and it turns out your first link is something I was working on last month for a leetcode question.

It was probably one of the more useful/understandable pages I found in that process.

FreeFull · on April 24, 2017

Interesting how the Rust implementation of glob currently seems to be the slowest out of the linear time implementations. I guess maybe not too much optimisation effort was put into it?

kibwen · on April 24, 2017

There's no implementation of globbing in Rust's standard library, so we'd need more information regarding which implementation was used for the blog post. I'm guessing one of two possibilities:

1. The glob crate, which is stdlib ejecta from pre-1.0, and appears to be in maintenance mode (it hasn't seen a "real" commit since Jan 2016).

2. The regex crate, which is very actively developed (and in fact inspired by rsc's writings and Go's regex implementation).

If it's the former, then indeed it's probably as disinteresting as "nobody has bothered to ever benchmark this crate". But if it's the latter then I bet burntsushi would be very interested!

rsc · on April 24, 2017

I googled for Rust glob match and ended up at the glob crate. This is my Rust program:

  extern crate glob;
  extern crate time;
  use glob::glob;
  use time::PreciseTime;
  
  fn main() {
      std::env::set_current_dir("/tmp/glob").expect("cd");
      
      for i in 0..100 {
          let pattern = "a*".repeat(i)+"b";
          let t = PreciseTime::now();
          let mul = 100;
          for j in 0..mul {
              for entry in glob(&pattern).expect("Failed to read glob pattern") {
                  match entry {
                      Ok(path) => println!("{:?}", path.display()),
                      Err(e) => println!("{:?}", e),
                  }
              }
          }
          let t1 = PreciseTime::now();
          println!("{:?} {:?}", i, (t.to(t1).num_nanoseconds().expect("ok") as f64)/1e9/(mul as f64))
      }
  }

Honestly I'm not too worried about Rust's performance here: it's on the right side of the fence.

kibwen · on April 24, 2017

Thanks for the info! In the course of this conversation I've also been made aware of the globset crate, by the same author as the regex crate: https://crates.io/crates/globset . I'm going to start a conversation about seeing whether anybody might want to start deprecating glob in favor of globset, in the interest of having just one "blessed" library for this.

burntsushi · on April 24, 2017

FWIW, the globset crate works by converting globs to regexes (so it is therefore also linear time). The downside is that you need to pay the overhead of regex compilation. In general, it should be pretty small since globs probably don't result in huge regexes, but I bet compilation is slower than what's in the `glob` crate currently. (I've never benchmarked it though, so I'm just guessing.)

stouset · on April 25, 2017

Doesn't your regex crate avoid compilation of various types of "simple" regexes, and instead just translates them to simple string library calls? If so, wouldn't most globs be converted to such?

burntsushi · on April 25, 2017

You have the general idea right in that the matching routine itself might completely avoid the regex engine. But that doesn't mean the implementation saves every possible instruction during compilation. For the most part, compilation is assumed to be amortized somehow, so as long as it's reasonable, not much work has gone into making it as fast as possible.

zokier · on April 25, 2017

I assume compiling/translating globs to regexp also takes some time. I wonder if that could be meaningful in some pathological cases?

burntsushi · on April 25, 2017

It's meaningful when you can't amortize compilation. I wouldn't call those pathological, but they seem niche.

The translation from glob to regex at the syntax level is likely a non-factor. It's the compilation of the regex that probably starts to matter.

eddyb · on April 25, 2017

Why just deprecate? Can't it be made to proxy behind the scenes to globset?

taeric · on April 24, 2017

I'd question if it should be considered maintenance mode if it is complete.

I understand why monolithic utility and frameworks need constant updating. It seems that in the "micro library" world, though, we should see and encourage "finished" libraries.

tomjakubowski · on April 24, 2017

Maintenance mode just suggests that no one is currently trying to make it any faster.

kibwen · on April 24, 2017

Indeed, I draw a stark line between "maintenance mode" and "abandoned". The glob crate isn't abandoned, it still accepts patches, but it doesn't appear to be anyone's priority.

EDIT: It's also worth mentioning that there's an important mental distinction between a "1.0-level" maintenance mode crate and a pre-1.0 maintenance mode crate, as the former at least implies that somebody considers the crate to be marginally usable and feature-complete. glob is a pre-1.0 crate, so even though patches are accepted, I wouldn't necessarily consider it to be well-vetted without doing more research.

taeric · on April 24, 2017

This makes sense, but seems unlikely to be what most people hear when you say something is in maintenance mode. In particular, "keep the lights on" is a death sentence for most projects.

zokier · on April 24, 2017

On a glance, I don't see any support for globbing in regex crate.

adito · on April 24, 2017

I guess it's the one that's part of ripgrep,

https://github.com/BurntSushi/ripgrep/tree/master/globset

> This crate implements globs by converting them to regular expressions, and executing them with the regex crate.

kibwen · on April 24, 2017

I was originally sort of hoping that the OP had written some code to implement globbing based on the regex crate, because that sounds like it would be interesting. :) (And because I forgot that the glob crate existed, frankly.)

avar · on April 24, 2017

There's another way for glob() implementations to mitigate these sort of patterns that Russ doesn't discuss, but can be inferred from a careful reading of the different examples in this new glob() article & the 2007 regex article.

In the regex article he notes that e.g. perl is subject to pathological behavior when you match a?^na^n against an a^n:

    $ time perl -wE 'my $l = shift; my $str = "a" x $l; my $rx = "a?" x $l . $str; $str =~ /${rx}/' 28
    real    0m13.278s

However changing the pattern to /${rx}b/ makes it execute almost instantly. This is because the matcher will look ahead for fixed non-pattern strings found in the pattern, and deduce that whatever globbing we're trying to match now it can't possibly matter if the string doesn't have a "b" in it.

I wonder if any globbing implementations take advantage of that class of optimization, and if there's any cases where Russ's suggested solution of not backtracking produces different results than you'd get by backtracking, in particular with some of the extended non-POSIX glob syntax out there.

rurban · on April 24, 2017

That's not globbing but using the regex matcher. perl's glob does the same as PHP, it simply calls the system glob. So I'm curious why the graph displays it as exponential on linux, where it should be linear.

In your example pcre2 performs much better than perl btw: it errors with match limit exceeded (-47), while perl happily burns exponential CPU. It's now even worse than before Russ' original perl article. Now it descends into heap memory, before only into the stack. So now it will keep crunching forever on the whole heap, while before the perl 5.10 rewrite triggered by Russ it died fast on stack overflow.

rsc · on April 24, 2017

Empirically, Perl's glob does not call the system glob. On my Linux system Perl (5.18.2) is slow but the system glob is fast. Create a file named /tmp/glob/$(perl -e 'print "a"x100') in an otherwise empty /tmp/glob and then try:

  $ cat tglob.pl
  #!/usr/bin/perl
  
  use Time::HiRes qw(clock_gettime);
  
  $| = 1;
  chdir "/tmp/glob" || die "$!";
  for($i=0; $i<9; $i++) {
      $pattern = ("a*"x$i) . "b";
      $t = clock_gettime(CLOCK_REALTIME);
      $mul = 10;
      if($i >= 5){ 
          $mul = 1;
      }
      for($j=0; $j<$mul; $j++) {
          glob $pattern;
      }
      $t1 = clock_gettime(CLOCK_REALTIME);
      printf("%d %.9f\n", $i, ($t1-$t)/$mul);
  }
  $ perl -v
  
  This is perl 5, version 18, subversion 2 (v5.18.2) built for x86_64-linux-gnu-thread-multi
  ...
  $ perl tglob.pl
  0 0.000004911
  1 0.000016212
  2 0.000088072
  3 0.002416682
  4 0.030226517
  5 0.452545881
  6 6.872966528
  ^C

You're the second person to claim that Perl calls the system glob though (someone in my blog comments did too). Maybe different versions of Perl do different things? This is Ubuntu 14.04 if that matters.

rurban · on April 26, 2017

It uses the wrong system glob :) It either calls out on VMS or Windows to a perlglob executable, or uses BSD glob from a shared lib. So whenever perl makes a choice it makes the wrong one.

demerphq · on April 25, 2017

No you are right. Modern perls use the BSD code in File::Glob.

rsc · on April 25, 2017

More discussion at https://research.swtch.com/glob#comment-3272455315

avar · on April 24, 2017

    > That's not globbing but using the regex matcher.

Indeed. I'm just using the regex behavior to show that perl's regex matcher uses a general optimization to defeat patterns like the ones Russ is discussing using a strategy orthogonal to handling backtracking.

But Russ doesn't talk about using this strategy to optimize glob(), so it's worth pointing out that it could just as well be used in a glob() implementation.

I.e. just try to find a fixed part later in the pattern, and if it can't be found fail the entire match.

    > I'm curious why the graph displays [perl]
    > as exponential on linux, where it should be linear.

The glob() routine in perl uses a fork of a BSD glob which ships with perl: https://github.com/Perl/perl5/blob/blead/ext/File-Glob/bsd_g...

    > pcre2 performs much better than perl btw: it
    > errors with match limit exceeded (-47)

The pcre2 library will still perform better if you adjust the match limit so that the pattern actually matches, e.g. with --match-limit=1000000000 to pcre2grep. That finishes in around 1s with PCRE, 10s with Perl.

But in general a regex library can't be said to perform better just because it errors out with a match limit error sooner. That just means it's compiled with different defaults.

rsc · on April 25, 2017

Scanning for fixed non-pattern strings would help this example. It wouldn't help the general case (which is to say, it wouldn't help always), whereas a better algorithm would.

eriknstr · on April 24, 2017

OP, what version(s) of the BSD libc did you test? What OS, which version of the OS.

macOS only? NetBSD? FreeBSD? OpenBSD?

If you tested on FreeBSD, please file a bug at https://bugs.freebsd.org/bugzilla/enter_bug.cgi?product=Base...

I'm not a project member but I'm a user of the system so it's in my interest that issues like this are resolved.

Please let me know whether or not you file a bug so that if you do I don't duplicate bug reports and if you don't I can do some benchmarking myself.

rsc · on April 24, 2017

I copied the glob implementation from one of the BSDs - I believe FreeBSD - into a standalone C program and ran that program on the same Linux system as the rest of the tests. Here's the version that tests the system glob. If you run it on your FooBSD systems you can see whether it runs quickly or not. The program assumes that you've already done:

  rm -rf /tmp/glob
  mkdir /tmp/glob
  cd /tmp/glob
  touch $(perl -e 'print "a"x100')

And here's the program:

  #include <stdio.h>
  #include <glob.h>
  #include <unistd.h>
  #include <string.h>
  #include <stdlib.h>
  #include <dirent.h>
  #include <time.h>
  
  int
  main(void)
  {
      glob_t g;
      char pattern[1000], *p;
      struct timespec t, t2;
      double dt;
      int i, j, k;
  
      chdir("/tmp/glob");
      setlinebuf(stdout);
      
      int mul = 1000;
      for(i = 0; i < 100; i++) {
          p = pattern;
          for (k = 0; k < i; k++) {
              *p++ = 'a';
              *p++ = '*';
          }
          *p++ = 'b';
          *p = '\0';
          printf("# %d %s\n", i, pattern);
          clock_gettime(CLOCK_REALTIME, &t);
          for (j = 0; j < mul; j++) {
              memset(&g, 0, sizeof g);
              if(glob(pattern, 0, 0, &g) != GLOB_NOMATCH) {
                  fprintf(stderr, "error: found matches\n");
                  exit(2);
              }
              globfree(&g);
          }
          clock_gettime(CLOCK_REALTIME, &t2);
          t2.tv_sec -= t.tv_sec;
          t2.tv_nsec -= t.tv_nsec;
          dt = t2.tv_sec + (double)t2.tv_nsec/1e9;
          printf("%d %.9f\n", i, dt/mul);
          fflush(stdout);
          if(dt/mul >= 0.0001)
              mul = 1;
          if(i >= 8 && dt/mul >= 10)
              break;
      }
  }

I won't be filing any specific bugs myself. I mailed oss-security@ this morning, which should reach relevant BSD maintainers, but more bug filing can't hurt.

loeg · on April 24, 2017

It's still exponential on FreeBSD CURRENT (per rsc's test program in the sibling thread):

    # 0 b
    0 0.000000661
    # 1 a*b
    1 0.000005204
    # 2 a*a*b
    2 0.000036157
    # 3 a*a*a*b
    3 0.001124974
    # 4 a*a*a*a*b
    4 0.046123932
    # 5 a*a*a*a*a*b
    5 0.577535034
    # 6 a*a*a*a*a*a*b
    6 8.666314375
    # 7 a*a*a*a*a*a*a*b
    7 119.796626615

(I didn't wait for 8.)

Please file a bug.

avar · on April 24, 2017

Slightly off-topic, but anyone know what he's using to generate those inline SVG graphs? I've been looking for some easy to use graphing library like that to present similar performance numbers on a webpage.

rsc · on April 24, 2017

I looked around but didn't find anything I liked. The regexp article [1] uses jgraph [2] to plot the data as eps, then ghostscript to turn eps to png. That no longer works in a world of variable-dpi screens, so for this article I generated SVGs and inlined them into the HTML. The SVGs are generated by a custom program I wrote for the job [3], to mimic the old jgraph graphs. It's not pretty code, but it gave me a lot of control over the presentation and produced decent results. You're welcome to adapt it if you like.

[1] https://swtch.com/~rsc/regexp/regexp1.html

[2] https://web.eecs.utk.edu/~plank/plank/jgraph/jgraph.html

[3] https://research.swtch.com/globgraph.go

nimrody · on April 24, 2017

My guess is that it's Gnuplot. At least gnuplot has an SVG driver: http://gnuplot.sourceforge.net/demo_svg_4.6/

bcherny · on April 24, 2017

No giveaways in the source code - will probably have to wait for the author to comment!

lexpar · on April 24, 2017

Not sure if OP is author, but if you are, just to inform you, there is a small typo in this paragraph:

"Unfortunately, none of tehse protections address the cost of matching a single path element of a single file name. In 2005, CVE-2005-0256 was issued for a DoS vulnerability in WU-FTPD 2.6.2, because it ran for a very long time finding even a single match during:"

Very informative article. Thanks for it!

rsc · on April 24, 2017

Thanks, fixed the typo (author, not OP).

tyingq · on April 24, 2017

The bsd derived glob has other functionality that I assume isn't simple or fast:

  perl -MFile::Glob=bsd_glob -e 'print bsd_glob("{{a,b,c}{1,2,3}{{yuck,Yuck},{urgh,URGH}}}\n")'

Produces 36 lines representing all the iterations. Nest a bit deeper and it gets unwieldy.

loeg · on April 30, 2017

Those are actually expanded (in the recursive fashion you would expect) before any "star" matching is done.

rsc · on April 24, 2017

There have been various CVEs already around brace expansion, but yes, it gets bad.

maweki · on April 24, 2017

I wonder whether it would help to match from both sides (start and end) simultaneously, since you know you're not looking in the middle of the string. You also don't care about capture groups.

rsc · on April 24, 2017

It would help this example. It wouldn't help the general case.

maweki · on April 25, 2017

But in general the reversal of a glob pattern is trivial while the reversal of the equivalent regular expression is not, no?

rsc · on April 25, 2017

Reversing a regular expression and reversing a glob are about the same: you just flip everything around and you're done. In fact RE2 uses this fact to speed up regexp searches. See https://swtch.com/~rsc/regexp/regexp3.html#submatch and scan ahead a bit for "Run the DFA backward".

mixu · on April 25, 2017

For fun, I ran this against node-glob ( https://github.com/isaacs/node-glob ).

Looks like it exhibits the slower behavior:

  n,elapsed
  1,0.07
  2,0.07
  3,0.07
  4,0.07
  5,0.16
  6,1.43
  7,19.90
  8,240.76

See this gist for the script https://gist.github.com/mixu/e4803da16e42439480eba2b29fa4448...

JdeBP · on April 24, 2017

> Graphical FTP clients typically use the MLST and MLSD commands

Do not count WWW browsers amongst the number of those graphical FTP clients. The common WWW browsers that speak FTP use LIST or LIST -l . With the exception of Google Chrome when it thinks that it is talking to a VMS program, they do not pass pattern arguments, though.

libre-man · on April 24, 2017

I tested Common Lisp. SBCL seems to be exponential while Clozure CL is not.

However it should be noted that it is non portable to do globbing in Common Lisp, so I expect most users implement it using something CL-FAD or OSICAT and CL-PPCRE, and CL-PPCRE is efficient.

E6300 · on April 24, 2017

I've been playing around with my own glob implementation. From what I've seen, the simplified algorithm mentioned in the article wouldn't be able to handle question marks. In particular, I don't think a non-backtracking algorithm can handle a pattern like "?a?a?a?a?b". I've been working to minimize the worst-case behavior, but it's tricky.

JdeBP · on April 25, 2017

Your semantics for question marks are wrong. Question marks matching zero or one characters are the semantics for IBM/Microsoft command interpreters, for reasons that go back all of the way to CP/M. (Strictly speaking the original semantics amounted to, because of the 8.3 field padding, question marks matching a character, the end of the string, or zero characters if immediately before the dot.)

In POSIX, question marks match exactly one character, always. There's no need for backtracking.

* http://pubs.opengroup.org/onlinepubs/9699919799/utilities/V3...

E6300 · on April 25, 2017

HN screwed up my comment. It's not supposed to be "question mark, a, question mark, a, ...". I meant to write "question mark, a, star, question mark, a, star, ...".

JdeBP · on April 26, 2017

That would be backtracking caused by those stars, not by the question marks. Hacker News didn't screw up the part where you wrote "wouldn't be able to handle question marks".

E6300 · on April 26, 2017

A pattern like "star a star a star a star" can be handled by the algorithm described in the article:

> Consider the pattern "a star bx star cy star d". If we end the first star at the first bx, we have the rest of the name to find the cy and then the d. Using any later bx can only remove choices for cy and d; it cannot lead to a successful match that using the first bx missed. So we should implement a star bx star cy star d without any second-guessing, as “find a leading a, then find the earliest bx after that, then find the earliest cy after that, then find a trailing d, or else give up.”

This algorithm doesn't work if the pattern has question marks.

rsc · on April 27, 2017

Yes it does.

E6300 · on April 26, 2017

I made a type in my previous response. I wrote "can be handled" instead of "can't be handled". The algorithm described can't handle that case.

rsc · on April 25, 2017

In addition to what JdeBP said about the semantics of question marks, there's always the option of using a proper automata translation (or translation to a regexp with a suitable underlying engine). Neither of those require backtracking.

mlgh · on April 25, 2017

Sorry, but the implementation posted is O(|pattern| * |name|), not linear. http://ideone.com/2xCXyY

burntsushi · on April 25, 2017

The size of the pattern is held as constant. Or more accurately, "it's linear" is an abbreviated form of "it's linear in the size of the text searched."

jankedeen · on April 25, 2017

How about the default sort? Ouch or no ouch?

BuuQu9hu · on April 24, 2017

We independently reinvented an adaptation of this algorithm for Monte's "simple" quasiliteral, which does simple string interpolation and matching. The code at https://github.com/monte-language/typhon/blob/master/mast/pr... is somewhat similar in appearance and structure to the examples in the post.

  def name := "Hackernews"
  # greeting == "Hi Hackernews!"
  def greeting := `Hi $name!`
  # language == "Lojban"
  def `@language is awesome` := "Lojban is awesome"

A quirk of our presentation is that adjacent zero-or-more patterns degenerate, with each subsequent pattern matching the empty string. This mirrors the observation in the post that some systems can coalesce adjacent stars without changing the semantics:

  # one == "", two == "cool"
  def `adjacent @one@two patterns` := "adjacent cool patterns"

triangleman · on April 24, 2017

hey BuuQu9hu, your comment started out as [dead] for some reason. You may have been inadvertently hellbanned.

proaralyst · on April 25, 2017

...hellbanned?

darklajid · on April 25, 2017

Being (silently) hidden so that your posts look okay to you but don't show up for others (unless they have 'show dead' set to on).

It's an effective filtering mechanism, but some consider it rather cruel and try to notify posters about it.

triangleman · on April 25, 2017

False positives suck. There are people coming on here day after day, posting valuable or at least worthy comments, and nobody can see them. They waste their time unknowingly. It is cruel.

oconnore · on April 24, 2017

Why write a glob engine at all when you already have a fast regex implementation that can match both exact paths and plausible subtrees?

The bulk of the haskell code to do this:

    parseGlob :: Char -> Char -> String -> Parser Glob
    parseGlob escC sepC forbid =
        many1' (gpart <|> sep <|> glob <|> alt) >>= return . GGroup . V.fromList
      where gpart = globPart escC (sepC : (forbid ++ "{*")) >>= return . GPart
            sep = satisfy (== ch2word sepC) >> return GSeparator
            alt = do
              _ <- AttoC.char '{'
              choices <- sepBy' (GEmpty `option` parseGlob escC sepC (",}" ++ forbid)) (char ',')
              _ <- AttoC.char '}'
              return $ GAlternate $ V.fromList choices
            glob = do
              res <- takeWhile1 (== ch2word '*')
              if B.length res == 1 then
                return GSingle
              else
                return GDouble

    wrapParens s = T.concat ["(", s, ")"]

    globRegex :: Char -> Glob -> T.Text
    globRegex sep  GSingle       = T.concat ["([^", T.singleton sep, "]*|\\", T.singleton sep, ")"]
    globRegex _    GDouble       = ".*"
    globRegex _    GEmpty        = ""
    globRegex sep  GSeparator    = T.singleton sep
    globRegex sep (GRepeat a)    = T.concat ["(", T.concat (V.toList $ fmap (globRegex sep) a), ")*"]
    globRegex sep (GGroup a)     = T.concat $ V.toList $ fmap (globRegex sep) a
    globRegex _   (GPart p)      = T.concatMap efun base
      where base = TE.decodeUtf8 p
            escChars = S.fromList ".[]()\\{}^$*+"
            efun c = if S.member c escChars
                     then T.concat ["\\", T.singleton c]
                     else T.singleton c
    globRegex sep (GAlternate a) =
        if V.null alts
        then ""
        else T.concat [altsStr, if hasEmpty then "?" else ""]
      where hasEmpty = isJust $ V.find (== GEmpty) a
            alts = fmap (globRegex sep) $ V.filter (/= GEmpty) a
            altsStr = wrapParens $ T.intercalate "|" $ V.toList alts

f2f · on April 24, 2017

    Why write a glob engine at all when you already have a fast regex

you didn't read the article. there are glob implementations that do just that.

jwilk · on April 24, 2017

From the HN guidelines:

Please don't insinuate that someone hasn't read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."

oconnore · on April 24, 2017

The bulk of the article is about how the Go implementation avoids this behavior without just converting to regex.

hkeide · on April 24, 2017

No, the Go standard library implementation is mentioned in a single paragraph. The Go code in the article is written simply as an illustration.

taeric · on April 24, 2017

I took it more as the bulk of the article explaining the differences and problems in algorithms. Just knowing that you can use a non-backtracking regex engine is fine, of course. Knowing how folks got it wrong is pedagogical in its own right.

As an easy additional point to the lesson, it might not be clear to everyone why you have to take care on which regex engine you move to.

thomasahle · on April 24, 2017

Did you just write this now? Are you saying that the above parser is linear time worst case?

oconnore · on April 24, 2017

I wrote it a few months ago. The resulting regex is linear time worst case in thompson or dfa regex engines.

gwu78 · on April 24, 2017

https://github.com/skarnet/execline/raw/master/src/execline/...

https://github.com/skarnet/execline/raw/master/src/libexecli...

Simple.

http://www.in-ulm.de/~mascheck/various/argmax/

   execlineb -c 'elglob a /*/*/*/* ls $a'

(statically-linked execlineb)

If I am not mistken, ARG_MAX will be the limit.

Straightforward.

JdeBP · on April 24, 2017

That's a wrapper around the C library's glob() function, which is what the headlined article was looking at. What point are you trying to make?