Hacker News new | comments | show | ask | jobs | submit login

I have not looked at the other linear-time implementations to see what they do, but I expect they all use one of these two approaches.

Python's glob() (fnmatch really) translates the glob to a regular expression then uses its re library:


Thanks for pointing this out. I wrote a Python test program but forgot to run that test in the data collection for the graphs. I will update them once the tests finish running. Ironically, Python must be passing the glob to an exponential-time regexp library, since Python is on the slow side of the fence.

Russ, while you're around, thanks for re2! RE2::Set especially! But why is RE2::Set so underdocumented and underpromoted? All regex libraries need this functionality.

You're welcome. I think basically all of RE2 is underdocumented and underpromoted. Why should RE2::Set be any different?

Seriously, though, it was a bit of an afterthought. A team at Google was |ing together a ton of regexps and came to me for something better, so I wrote RE2::Set. I'm glad it helps others too.

Your previous article on regex engines I think indicated that Python's is backtracking and with poor performance.

Indeed, I just didn't want to presume that nothing had changed in the past 10 years (but apparently not).

Python's regex library compiles the regex to a sequence of opcodes and then does recursive descent, so yes, it's exponential.

Yep, Python's "re" is the usual PCRE+X story.

Python has its own engine which is not based on PCRE AFAIK:


I noticed you linked to python-git instead of the main Python repo. The whole project has been moved as of a couple months.


Yeah, that was a mistake on my part but by the time somone pointed it out I couldn't edit my comment anymore. Sorry. :-(

Here is the same but with star and doublestar for handling slashes:

https://github.com/borgbackup/borg/blob/master/src/borg/shel... http://borgbackup.readthedocs.io/en/stable/usage.html#borg-h... ("Shell-style patterns, selector sh:")

That implementation looks pretty inefficient, creating lots of strings that are thrown away immediately. Could it not be refactored into a generator combined with a string.join?

I think Python's string type has an optimization that concatenation on a string where the LHS only has a single reference to it will reuse the existing string when possible. You still have to grow the buffer periodically, of course, but it's not creating as much garbage as it looks like.

> I think Python's string type has an optimization that concatenation on a string where the LHS only has a single reference to it will reuse the existing string when possible.

It's specifically a CPython optimisation, it may (and usually will) not hold up on alternate implementations like pypy or jyton.

Sure, but the function in question lives in CPython's own implementation of its core libraries, so it seems reasonable for it to rely on that implementation.

If PyPy or Jython want to reuse that core library code, it's on them to make whatever changes are needed to make it fast on those implementations.

Usual use would be "translate pattern once, compile it and then match 1000 file names against it".

Ahh yes, it's cached in fnmatchcase. It's pretty fast as well, <10ยตs for a basic match.

A cache without any eviction! Probably should not expose `fnmatch` to the network where the pattern varies.

That's an old version of Python source from a mirror. It uses functools.lru_cache with a maxsize of 256 now: https://github.com/python/cpython/blob/fcfe80ec2592fed8b3941...

Gah, thanks for the proper link. I should've double checked (and realized it since no updates in 8 years is unlikely).

The RE engine here is only used as a fast VM; so the pathological cases from the article should still apply. So apart from the memory use issues there's also that.

It should be updated with a functools.lru_cache decorator I think.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact