Hacker News new | past | comments | ask | show | jobs | submit login

> Nonetheless, every popular programming language (except Go[1])

... and Rust. Although Rust doesn't have a regex library in std, the official regex crate (I am its author) descends from RE2[1].

> So it's feasible for the engine to optimistically try to use an efficient implementation

Feasible, sure. But not easy or simple. If Python's re2 module uses RE2 by default and Python's regex engine when necessary, then it's very likely that the match offsets it reports are not coherent. RE2 aims to be mostly consistent with things like PCRE, but there are cases where it isn't.

The problem is, building a production grade FSM regex engine or a production grade backtracking regex engine are both huge engineering projects. And there are few, if any, people that know how to do both. They both have their own little specialties and there isn't a ton of crossover among implementors of regex engines.

It's one of those ideas that sounds so stupidly obvious in practice, but when you actually sit down to engineer the thing, it's quite a bit harder than you might imagine.

> (And if you are using them, then I'd argue you're shoving a bit too much logic into a regex, but that's beside the point).

This kind of ignores the fact that regexes are often the interface to something else. If you have a full blown programming language at your fingertips, it's usually not hard to work around the absence of look-around or backreferences by, say, using a second regex or something. But when all you have available to you is, say, a blacklist and you get to provide the regexes to that blacklist, then a more expressive regex syntax becomes a lot more useful. See for example people already running into problems with Discord's new feature: https://github.com/rust-lang/regex/issues/127#issuecomment-1...

Of course, as I noted, it would be very inappropriate for Discord to permit non-regular regexes here. It's borderline already by permitting any regexes at all, because even FSMs have their pathological cases, albeit, not exponential. But if it's a tool you run on your local system? Sure, look-arounds become much more useful.

To be clear, I am pro-FSM and agree with you that they should really be the default regex engine everywhere. But instead, reality is indeed inverted, with backtracking being the default (almost) everywhere. But I wanted to pipe up about a few points that I think you might have not quite captured precisely.

[1]: https://github.com/rust-lang/regex/




Thanks! All good clarifications. You're definitely right on all counts, and it was more than a bit naive of me to suggest that language/library authors "simply" have two production grade regex implementations :)

>> (And if you are using them, then I'd argue you're shoving a bit too much logic into a regex, but that's beside the point).

> This kind of ignores the fact that regexes are often the interface to something else.

I will definitely admit to using the occasional assertion or backreference in grep or sed from time to time, so you make a good point about local tools too. I suppose I was thinking more in terms of programming APIs when I said this, and so keeping the standard library FSM-based would still apply here. I'd imagine that special-purpose text munging tools which care that much, will likely have their own engines anyway.

Thanks also for your work on the regex crate. I'm still dipping my toes into Rust but I'm glad to know there's an FSM-based default regex library!


Indeed. It doesn't take long before look-around inside of a grep command ends up being quite convenient, sometimes even saving yourself from a much longer shell pipeline. It's why I ultimately relented and added -P/--pcre2 to ripgrep. (Its default regex engine is the Rust regex crate.)


> > Nonetheless, every popular programming language (except Go[1])

> ... and Rust. Although Rust doesn't have a regex library in std, the official regex crate (I am its author) descends from RE2[1].

... and Tcl (see my earlier post). You might argue that Tcl is not popular but it still has a significant following.


I'm not a Tcl expert, but now that I think about it, I'm not sure you're correct. My recollection is that Tcl is one of the few to have a true hybrid engine. So if you use look-around or backreferences, then Tcl could provide an exponential runtime. If that's true, then Tcl would be included in "Nonetheless, every popular programming language (except Go[1]) has included an exponential backtracking regex implementation in their standard library," with perhaps a caveat or special mention. But it's still a categorically different guarantee than what, say, RE2 provides.


... sure. I don't personally consider Tcl to be popular at present though.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: