Thanks for pointing this out. I wrote a Python test program but forgot to run that test in the data collection for the graphs. I will update them once the tests finish running. Ironically, Python must be passing the glob to an exponential-time regexp library, since Python is on the slow side of the fence.
Russ, while you're around, thanks for re2! RE2::Set especially! But why is RE2::Set so underdocumented and underpromoted? All regex libraries need this functionality.
You're welcome. I think basically all of RE2 is underdocumented and underpromoted. Why should RE2::Set be any different?
Seriously, though, it was a bit of an afterthought. A team at Google was |ing together a ton of regexps and came to me for something better, so I wrote RE2::Set. I'm glad it helps others too.
That implementation looks pretty inefficient, creating lots of strings that are thrown away immediately. Could it not be refactored into a generator combined with a string.join?
I think Python's string type has an optimization that concatenation on a string where the LHS only has a single reference to it will reuse the existing string when possible. You still have to grow the buffer periodically, of course, but it's not creating as much garbage as it looks like.
> I think Python's string type has an optimization that concatenation on a string where the LHS only has a single reference to it will reuse the existing string when possible.
It's specifically a CPython optimisation, it may (and usually will) not hold up on alternate implementations like pypy or jyton.
Sure, but the function in question lives in CPython's own implementation of its core libraries, so it seems reasonable for it to rely on that implementation.
If PyPy or Jython want to reuse that core library code, it's on them to make whatever changes are needed to make it fast on those implementations.
The RE engine here is only used as a fast VM; so the pathological cases from the article should still apply. So apart from the memory use issues there's also that.
Previously: "Regular Expression Matching Can Be Simple And Fast" (2007) https://swtch.com/~rsc/regexp/regexp1.html
The paper deals with "Thompson NFA" approach to regex, with low computational complexity.
It's funny, I had an immediate "I've seen these graphs before" reaction" with this article, and it turns out your first link is something I was working on last month for a leetcode question.
It was probably one of the more useful/understandable pages I found in that process.
Interesting how the Rust implementation of glob currently seems to be the slowest out of the linear time implementations. I guess maybe not too much optimisation effort was put into it?
There's no implementation of globbing in Rust's standard library, so we'd need more information regarding which implementation was used for the blog post. I'm guessing one of two possibilities:
1. The glob crate, which is stdlib ejecta from pre-1.0, and appears to be in maintenance mode (it hasn't seen a "real" commit since Jan 2016).
2. The regex crate, which is very actively developed (and in fact inspired by rsc's writings and Go's regex implementation).
If it's the former, then indeed it's probably as disinteresting as "nobody has bothered to ever benchmark this crate". But if it's the latter then I bet burntsushi would be very interested!
I googled for Rust glob match and ended up at the glob crate. This is my Rust program:
extern crate glob;
extern crate time;
use glob::glob;
use time::PreciseTime;
fn main() {
std::env::set_current_dir("/tmp/glob").expect("cd");
for i in 0..100 {
let pattern = "a*".repeat(i)+"b";
let t = PreciseTime::now();
let mul = 100;
for j in 0..mul {
for entry in glob(&pattern).expect("Failed to read glob pattern") {
match entry {
Ok(path) => println!("{:?}", path.display()),
Err(e) => println!("{:?}", e),
}
}
}
let t1 = PreciseTime::now();
println!("{:?} {:?}", i, (t.to(t1).num_nanoseconds().expect("ok") as f64)/1e9/(mul as f64))
}
}
Honestly I'm not too worried about Rust's performance here: it's on the right side of the fence.
Thanks for the info! In the course of this conversation I've also been made aware of the globset crate, by the same author as the regex crate: https://crates.io/crates/globset . I'm going to start a conversation about seeing whether anybody might want to start deprecating glob in favor of globset, in the interest of having just one "blessed" library for this.
FWIW, the globset crate works by converting globs to regexes (so it is therefore also linear time). The downside is that you need to pay the overhead of regex compilation. In general, it should be pretty small since globs probably don't result in huge regexes, but I bet compilation is slower than what's in the `glob` crate currently. (I've never benchmarked it though, so I'm just guessing.)
Doesn't your regex crate avoid compilation of various types of "simple" regexes, and instead just translates them to simple string library calls? If so, wouldn't most globs be converted to such?
You have the general idea right in that the matching routine itself might completely avoid the regex engine. But that doesn't mean the implementation saves every possible instruction during compilation. For the most part, compilation is assumed to be amortized somehow, so as long as it's reasonable, not much work has gone into making it as fast as possible.
I'd question if it should be considered maintenance mode if it is complete.
I understand why monolithic utility and frameworks need constant updating. It seems that in the "micro library" world, though, we should see and encourage "finished" libraries.
Indeed, I draw a stark line between "maintenance mode" and "abandoned". The glob crate isn't abandoned, it still accepts patches, but it doesn't appear to be anyone's priority.
EDIT: It's also worth mentioning that there's an important mental distinction between a "1.0-level" maintenance mode crate and a pre-1.0 maintenance mode crate, as the former at least implies that somebody considers the crate to be marginally usable and feature-complete. glob is a pre-1.0 crate, so even though patches are accepted, I wouldn't necessarily consider it to be well-vetted without doing more research.
This makes sense, but seems unlikely to be what most people hear when you say something is in maintenance mode. In particular, "keep the lights on" is a death sentence for most projects.
I was originally sort of hoping that the OP had written some code to implement globbing based on the regex crate, because that sounds like it would be interesting. :) (And because I forgot that the glob crate existed, frankly.)
There's another way for glob() implementations to mitigate these sort of patterns that Russ doesn't discuss, but can be inferred from a careful reading of the different examples in this new glob() article & the 2007 regex article.
In the regex article he notes that e.g. perl is subject to pathological behavior when you match a?^na^n against an a^n:
$ time perl -wE 'my $l = shift; my $str = "a" x $l; my $rx = "a?" x $l . $str; $str =~ /${rx}/' 28
real 0m13.278s
However changing the pattern to /${rx}b/ makes it execute almost instantly. This is because the matcher will look ahead for fixed non-pattern strings found in the pattern, and deduce that whatever globbing we're trying to match now it can't possibly matter if the string doesn't have a "b" in it.
I wonder if any globbing implementations take advantage of that class of optimization, and if there's any cases where Russ's suggested solution of not backtracking produces different results than you'd get by backtracking, in particular with some of the extended non-POSIX glob syntax out there.
That's not globbing but using the regex matcher. perl's glob does the same as PHP, it simply calls the system glob.
So I'm curious why the graph displays it as exponential on linux, where it should be linear.
In your example pcre2 performs much better than perl btw: it errors with match limit exceeded (-47), while perl happily burns exponential CPU. It's now even worse than before Russ' original perl article. Now it descends into heap memory, before only into the stack. So now it will keep crunching forever on the whole heap, while before the perl 5.10 rewrite triggered by Russ it died fast on stack overflow.
Empirically, Perl's glob does not call the system glob. On my Linux system Perl (5.18.2) is slow but the system glob is fast. Create a file named /tmp/glob/$(perl -e 'print "a"x100') in an otherwise empty /tmp/glob and then try:
You're the second person to claim that Perl calls the system glob though (someone in my blog comments did too). Maybe different versions of Perl do different things? This is Ubuntu 14.04 if that matters.
It uses the wrong system glob :)
It either calls out on VMS or Windows to a perlglob executable, or uses BSD glob from a shared lib.
So whenever perl makes a choice it makes the wrong one.
> That's not globbing but using the regex matcher.
Indeed. I'm just using the regex behavior to show that perl's regex matcher uses a general optimization to defeat patterns like the ones Russ is discussing using a strategy orthogonal to handling backtracking.
But Russ doesn't talk about using this strategy to optimize glob(), so it's worth pointing out that it could just as well be used in a glob() implementation.
I.e. just try to find a fixed part later in the pattern, and if it can't be found fail the entire match.
> I'm curious why the graph displays [perl]
> as exponential on linux, where it should be linear.
> pcre2 performs much better than perl btw: it
> errors with match limit exceeded (-47)
The pcre2 library will still perform better if you adjust the match limit so that the pattern actually matches, e.g. with --match-limit=1000000000 to pcre2grep. That finishes in around 1s with PCRE, 10s with Perl.
But in general a regex library can't be said to perform better just because it errors out with a match limit error sooner. That just means it's compiled with different defaults.
Scanning for fixed non-pattern strings would help this example. It wouldn't help the general case (which is to say, it wouldn't help always), whereas a better algorithm would.
I copied the glob implementation from one of the BSDs - I believe FreeBSD - into a standalone C program and ran that program on the same Linux system as the rest of the tests. Here's the version that tests the system glob. If you run it on your FooBSD systems you can see whether it runs quickly or not. The program assumes that you've already done:
#include <stdio.h>
#include <glob.h>
#include <unistd.h>
#include <string.h>
#include <stdlib.h>
#include <dirent.h>
#include <time.h>
int
main(void)
{
glob_t g;
char pattern[1000], *p;
struct timespec t, t2;
double dt;
int i, j, k;
chdir("/tmp/glob");
setlinebuf(stdout);
int mul = 1000;
for(i = 0; i < 100; i++) {
p = pattern;
for (k = 0; k < i; k++) {
*p++ = 'a';
*p++ = '*';
}
*p++ = 'b';
*p = '\0';
printf("# %d %s\n", i, pattern);
clock_gettime(CLOCK_REALTIME, &t);
for (j = 0; j < mul; j++) {
memset(&g, 0, sizeof g);
if(glob(pattern, 0, 0, &g) != GLOB_NOMATCH) {
fprintf(stderr, "error: found matches\n");
exit(2);
}
globfree(&g);
}
clock_gettime(CLOCK_REALTIME, &t2);
t2.tv_sec -= t.tv_sec;
t2.tv_nsec -= t.tv_nsec;
dt = t2.tv_sec + (double)t2.tv_nsec/1e9;
printf("%d %.9f\n", i, dt/mul);
fflush(stdout);
if(dt/mul >= 0.0001)
mul = 1;
if(i >= 8 && dt/mul >= 10)
break;
}
}
I won't be filing any specific bugs myself. I mailed oss-security@ this morning, which should reach relevant BSD maintainers, but more bug filing can't hurt.
Slightly off-topic, but anyone know what he's using to generate those inline SVG graphs? I've been looking for some easy to use graphing library like that to present similar performance numbers on a webpage.
I looked around but didn't find anything I liked. The regexp article [1] uses jgraph [2] to plot the data as eps, then ghostscript to turn eps to png. That no longer works in a world of variable-dpi screens, so for this article I generated SVGs and inlined them into the HTML. The SVGs are generated by a custom program I wrote for the job [3], to mimic the old jgraph graphs. It's not pretty code, but it gave me a lot of control over the presentation and produced decent results. You're welcome to adapt it if you like.
Not sure if OP is author, but if you are, just to inform you, there is a small typo in this paragraph:
"Unfortunately, none of tehse protections address the cost of matching a single path element of a single file name. In 2005, CVE-2005-0256 was issued for a DoS vulnerability in WU-FTPD 2.6.2, because it ran for a very long time finding even a single match during:"
I wonder whether it would help to match from both sides (start and end) simultaneously, since you know you're not looking in the middle of the string. You also don't care about capture groups.
Reversing a regular expression and reversing a glob are about the same: you just flip everything around and you're done. In fact RE2 uses this fact to speed up regexp searches. See https://swtch.com/~rsc/regexp/regexp3.html#submatch and scan ahead a bit for "Run the DFA backward".
> Graphical FTP clients typically use the MLST and MLSD commands
Do not count WWW browsers amongst the number of those graphical FTP clients. The common WWW browsers that speak FTP use LIST or LIST -l . With the exception of Google Chrome when it thinks that it is talking to a VMS program, they do not pass pattern arguments, though.
I tested Common Lisp. SBCL seems to be exponential while Clozure CL is not.
However it should be noted that it is non portable to do globbing in Common Lisp, so I expect most users implement it using something CL-FAD or OSICAT and CL-PPCRE, and CL-PPCRE is efficient.
I've been playing around with my own glob implementation. From what I've seen, the simplified algorithm mentioned in the article wouldn't be able to handle question marks. In particular, I don't think a non-backtracking algorithm can handle a pattern like "?a?a?a?a?b". I've been working to minimize the worst-case behavior, but it's tricky.
Your semantics for question marks are wrong. Question marks matching zero or one characters are the semantics for IBM/Microsoft command interpreters, for reasons that go back all of the way to CP/M. (Strictly speaking the original semantics amounted to, because of the 8.3 field padding, question marks matching a character, the end of the string, or zero characters if immediately before the dot.)
In POSIX, question marks match exactly one character, always. There's no need for backtracking.
HN screwed up my comment. It's not supposed to be "question mark, a, question mark, a, ...". I meant to write "question mark, a, star, question mark, a, star, ...".
That would be backtracking caused by those stars, not by the question marks. Hacker News didn't screw up the part where you wrote "wouldn't be able to handle question marks".
A pattern like "star a star a star a star" can be handled by the algorithm described in the article:
> Consider the pattern "a star bx star cy star d". If we end the first star at the first bx, we have the rest of the name to find the cy and then the d. Using any later bx can only remove choices for cy and d; it cannot lead to a successful match that using the first bx missed. So we should implement a star bx star cy star d without any second-guessing, as “find a leading a, then find the earliest bx after that, then find the earliest cy after that, then find a trailing d, or else give up.”
This algorithm doesn't work if the pattern has question marks.
In addition to what JdeBP said about the semantics of question marks, there's always the option of using a proper automata translation (or translation to a regexp with a suitable underlying engine). Neither of those require backtracking.
The size of the pattern is held as constant. Or more accurately, "it's linear" is an abbreviated form of "it's linear in the size of the text searched."
We independently reinvented an adaptation of this algorithm for Monte's "simple" quasiliteral, which does simple string interpolation and matching. The code at https://github.com/monte-language/typhon/blob/master/mast/pr... is somewhat similar in appearance and structure to the examples in the post.
def name := "Hackernews"
# greeting == "Hi Hackernews!"
def greeting := `Hi $name!`
# language == "Lojban"
def `@language is awesome` := "Lojban is awesome"
A quirk of our presentation is that adjacent zero-or-more patterns degenerate, with each subsequent pattern matching the empty string. This mirrors the observation in the post that some systems can coalesce adjacent stars without changing the semantics:
# one == "", two == "cool"
def `adjacent @one@two patterns` := "adjacent cool patterns"
False positives suck. There are people coming on here day after day, posting valuable or at least worthy comments, and nobody can see them. They waste their time unknowingly. It is cruel.
Please don't insinuate that someone hasn't read an article. "Did you even read the article? It mentions that" can be shortened to "The article mentions that."
I took it more as the bulk of the article explaining the differences and problems in algorithms. Just knowing that you can use a non-backtracking regex engine is fine, of course. Knowing how folks got it wrong is pedagogical in its own right.
As an easy additional point to the lesson, it might not be clear to everyone why you have to take care on which regex engine you move to.
Python's glob() (fnmatch really) translates the glob to a regular expression then uses its re library:
https://github.com/python-git/python/blob/715a6e5035bb21ac49...