As someone who has crafted thousands of complex regular expression rules for data capture, here is my take:
1. This is a fine idea to aid regex newbies in crafting their expressions. I see this as a gateway instead of a longterm tool. The expressions won't be optimal (by no fault of the tool), nor will they likely be complete, but that's not the point. If it helps reduce the barrier(s) to adoption of regular expressions, then I can heartily support it.
2. To the people who say they use regular expressions only a handful of times a year, thus it's not worthwhile to invest time in learning the syntax, I offer this: once you know it, you will use it far more often than you ever expected. Find & replace text, piping output, Nginx.conf editing, or even the REGEXP() function in MySQL. It's a valuable skillset in so many environments that I expect you will use weekly, if not daily.
3. Ultimately regular expressions, like everything, are extra difficult until you know all of the available tools in the toolbox. At that point, you may realize you wrote an unnecessarily complex expression simply because you didn't know better.
There is another benefit to using regular expressions for replacements that is not obvious, but it is a huge productivity boost: When I decide to modify code using regexes instead of doing it by hand, my work becomes more transactional, and as a result I miss fewer errors when I realize that I need to re-do my work.
For example, let's say I have a bunch of function calls that need an extra parameter passed at the end. In vim, I might do something like
114,155s/);/, extra_parameter);/g
Of course, it might have been faster to copy and paste a bunch of times, but then I realize that I actually need 2 parameters, not one. Now I can just press 'u' do undo what I just did, escape colon, then 'up' to get the last regex which I can quickly modify:
When you attempt to do and undo this stuff by hand the chances become very high that you will make subtle mistakes somewhere. In my experience, when I use regexes it is more likely that everything is right or everything is wrong in an obvious way.
Re #2: As someone who more or less lives in the Linux command line. I concur with this. Once you know regexes well you can find productive time-saving uses for them dozens of times a day. I rarely go a couple of days without using a regex, never mind a couple of times of year.
Ah for Nginx, specifically, I meant to refer to location blocks. Granted, regex should be used very carefully for performance reasons, but when appropriate you can do some really cool pattern matching.
Our tool[0] for using persuasion principles on your site to increase conversion had a UX problem when setting things up. We'd like to have a generic way to detect what type of page a certain url is. Most obvious way was to go with regular expressions (/.\-.\-d+\.html for product pages for example).
Turned out this was by far the most misunderstood setting while it was most of the important ones. Target audience had something to do with it (marketeers), but even when Google analytics or google tag manager is widely used by them, setting up these expressions is really hard.
We decided to built an internal tool that generates a regular expression based on examples for which the regex must hold. We called it the regexhelper. It was so successfull that we made it into an external tool[1].
It's not perfect (in terms of generating the most efficient regexes), but it works fantastic for our audiences of marketeers. Planning to open source this as well!
An visual UI when dealing with regexes that are the result of our helper using this idea could be beneficial.
If you want readable regexp, just use combinators and your language's variable declaration facilities. No need for more.
I don't understand why people still insist on using insane syntax for regexps instead of just ... functions (`rep` for repetition, `seq` for sequences, `opt` for optional ..).
Regexps can be documented and split onto multiple lines in many languages and commented, be it through string concatenation or formatting modifiers. I write some complicated regular expressions, and I've found that splitting groups of expressions onto multiple lines and indenting them handles most of the problems that my coworkers have with groking them, and that I have when returning to them.
I prefer the concise syntax (provided that it's reasonably formatted) for the same reason that I prefer the concise syntax of sed, ed, and similar: it's easy to mentally map and reason about symbols than it is large blocks of text. I've been programming for nearly 20 years and I have found that I much prefer manipulating mathematical expressions than I do large chunks of code, because it uses a concise syntax where symbols mean something. I love such a notation. (In the case of programs, when refactoring, my mind works in blocks of code as units, as I'm sure most others' do.)
I'm not saying those benefits aren't possible with verbose code---they are. But just as many prefer a concise mathematical syntax to a verbose program that does the same thing, I prefer a concise formal definition.
I'm also not implying that you should try to write an entire grammar in a single regular expression.
Concise notations are great, and this is why regexp are so much used IMO. I am by the way a fan of sed, which is clever enough to give you the choice over the delimiter you use (s+/+_+g).
On the other hand, there are so many additions to the core formal language, like backtracking or Larry Wall knows what, that syntax has become cryptic. Besides, building regexps out of smaller ones is generally a pain with strings, because you need to quote special regex characters, along with any character that might interfere with the host language's syntax (e.g. emacs regexes with four backslashes in a row). I prefer to read actual words, so the following is fine for me:
After the recent discussions about Lisp, here is an actual example that can be used by CL-PPCRE to scan "\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b". The list structure allows you to compose your regular expression like any other list, with intermediate functions, etc. without having ever to thing about escaping your characters. When you need to use the string based, concise regex, you wrap it in a ":regex" form and you have the best of both worlds.
> On the other hand, there are so many additions to the core formal language, like backtracking or Larry Wall knows what, that syntax has become cryptic.
I'm not going to argue with that, though languages like Perl (which have extended regular expressions so much that they're not actually regular expressions anymore) also allow named groups, for example.
I'm not arguing that certain circumstances aren't difficult to understand. In such cases, I do actually compose expressions from separate ones (e.g. variables): but they're still a concise syntax.
> (e.g. emacs regexes with four backslashes in a row).
Yes, such cases are unfortunate and confusing.
> I prefer to read actual words, so the following is fine for me:
For me, that took much longer to parse mentally than the equivalent regex. Or, if you're okay with minor changes and a proper locale:
/\b[\w.%+-]+@[\w\d.-]+\.[A-Z]{2,}\b/
I suspect that someone used to reading the notation you provided would have opposite results than I do. The reason I find the actual formal notation for the regex easier is because there's less to keep in memory---all the verbose extras that I have to strip out when forming my mental image of the regex.
If the regular expression were more complicated, the solution you presented might not be so bad. I would normally format it like this (if we stick with the verbose character classes):
Edit: real life example, a minimal lexer used to split compound CSS selectors on comas. We must skip those in strings, comments and in `:not(a, b)` pseudo-classes, so `.split(',')` doesn't cut it.
function splitSelector(selector) {
var indices = [], res = [], inParen = 0, match
while (match = selectorTokenizer.exec(selector)) {
switch (match[0]) {
case '(': inParen++; break
case ')': inParen--; break
case ',': if (inParen) break; indices.push(match.index)
}
}
for (var i = indices.length; i--;){
res.unshift(selector.slice(indices[i] + 1))
selector = selector.slice(0, indices[i])
}
res.unshift(selector)
return res
}
I appreciate your library, that's pretty cool. I just want to point out that some (most?) regex libraries support a whitespace-insensitive mode, which allows you to write out the raw regex in a way that's considerably easier for humans to visually grok:
"[(),]" // match the foo part...
+ "|\"(?:\\.|[^\"\n])*\"" //... or, match the bar part in *double* quotes, putting quoted value in capture group 1...
+ "|'(?:\\.|[^'\\n])*'" //... or, match the bar part in *single* quotes, putting quoted value in capture group 1...
+ "|\/\*[\s\S]*?\*\" //... or, match whatever the hell that is.
Simple string splitting + commenting the semantic parts. Also, labeling the capture group (and creating named constants for them in your code next to your regex) is a huge win.
If the string you happen to match contains a lot of metacharacters, you end up with backslashes all over the place, which makes the result hard to read. Nested groups and captures are also often hard to parse.
FWIW, you forgot to double escape `\\\\.`, and didn't close the CSS comment (last alternative).
"[(),]" // match the foo part...
+ "|\"(?:\\\\.|[^\"\\n])*\"" //... or, match the bar part in *double* quotes
+ "|'(?:\\\\.|[^'\\n])*'" //... or, match the bar part in *single* quotes
+ "|\/\*[\s\S]*?\*\/" // or match the comment
Also, you're probably not familiar with the quirks of JS regexps, but the two string alternatives use non-capturing groups, and `[\s\S]` is the true "any" matcher, `.` doesn't match new lines. At last,
`*?` is a non-greedy `*`. (Edited thrice, damn you italics).
I admit I just copied your example and tried to sort-of convert it into Java style. It definitely won't be a correct Java-compatibile regex. As you said, I'm not familiar with the JS regex quirks. I just couldn't invent a good example on the spot, and didn't want to post ones from the code I work on at my day job for legal reasons.
And yeah, I agree about "lots of backslashes" part. It gets messy - but splitting regexps in parts makes it at least more manageable. I'm not yet angry enough at the cases I have at my day job to whip up a DSL for it though.
I do a lot of text processing in Lua, primarily because of LPeg. It can parse text that regular expressions can't (or have real trouble with, like http://www.ex-parrot.com/pdw/Mail-RFC822-Address.html), can transform the data on the fly (convert digit characters into its value) and more importantly, they're composable. Have an LPeg expression that can parse an email address? You can then plop that into a larger LPeg expression to parse, say, a header line.
Yeah, PEGs are pretty sweet. Perl6 is implementing a similar system (at least capability-wise) in the form of "Grammars": http://doc.perl6.org/language/grammars. They're what regexps should've been all along, IMO.
The problem with "regex" is that it left the pure computer science realm of true regular expressions, and thus lost many of the mathematical properties of regular expressions.
Regex's are then further abused to do things far beyond what true regular expressions can do, which results in cryptic regex expressions whose behaviors are implementation dependent instead of bounded by computer science principles.
Lua creator Roberto Ierusalimschy resurfaced and explored the idea of PEGs (Parsing Expression Grammars) as a better way to do the things that people have abused regex to do, while keeping it grounded in pure CS principles, allowing better syntax making things easier to express, more powerful behavior, mathematically grounded complexity (for performance), and more clarity in what can and cannot be accomplished.
This video presentation from the Lua Workshop explains all of this and more about why PEGs.
https://vimeo.com/1485123
The problem with grammars is that they are too verbose.
Regexes are concise. This is their strength and weakness.
I can write a regex in a dialog box after initiating a "Find". I can't specify a grammar like that.
The deeper problem with regexes is programmers. Most programmers do not have the perspective to say "Whoa. This regex is too much. I'm really doing parsing at this point and should probably switch up to a grammar."
But that isn't even completely true and you miss my point about regex falling outside the domain of "real" regular expresssions. Here's a Perl regex to validate email addresses according to RFC 822 (and it doesn't actually handle everything)
local P = lpeg.P
local R = lpeg.R
local S = lpeg.S
local V = lpeg.V
local C = lpeg.C
local CHAR = R"\0\127"
local SPACE = S"\40\32"
local CTL = R"\0\31" + P"127"
local specials = S[=[()<>@,;:\".[]]=]
local atom = (CHAR-specials-SPACE-CTL)^1
local dtext = CHAR - S"[]\\\13"
local qtext = CHAR - S'"\\\13'
local quoted_pair = "\\" * CHAR
local domain_literal = P"[" * ( dtext + quoted_pair )^0 + P"]"
local quoted_string = P'"' * ( qtext + quoted_pair )^0 * P'"'
local word = atom + quoted_string
-- Implements an email "addr-spec" according to RFC822
local email = P {
V"addr_spec" ;
addr_spec = V"local_part" * P"@" * C(V"domain") ;
local_part = word * ( P"." * word )^0 ;
domain = V"sub_domain" * ( P"." * V"sub_domain" )^0 ;
sub_domain = V"domain_ref" + domain_literal ;
domain_ref = atom ;
}
If you stay in the realm of "real" (theoretical/CS) regular expressions, then regex doesn't have to be nasty. But the fact is that most people are not doing this and trying to do things way outside of the domain. At this point, all bets are off and other tools may be more correct, more appropriate, and more concise.
Perl 6 Rules unify PEGs, regexes, and closures -- "[a] rule used in this way is actually identical to the invocation of a subroutine with the extra semantics and side-effects of pattern matching (e.g., rule invocations can be backtracked)."
First, referencing that regex is somewhat misleading. That's generally the end result of building the constituent parts through Perl's qr regex object creator, and them combining them into a larger regex[1]. Second, if you want to use a grammea, there's a way to to that in the more complex regex engines, you can use named captures to create a grammar within a regex, and then nun it[2].
First, "validating" an email address is pointless. This is why I can't use things like "+" or UTF-8 in an email address. You send to the email address and see if you get an error. The only people who should be parsing email addresses are mail programs.
And, second, RFC822 actually specifies a grammar so why would you expect a regex to do the job?
I can see the claimed advantages to what's proposed, but I feel like if the railroad diagram by RegExper could be reversed, that that would be a far more successful visual syntax for regular expressions. Then again, most of my regex-fu entails building a regex relatively close to what I want and then repeatedly throwing it at a local instance of RegExper and test strings until I have something which accomplishes what I'm looking for it to do. I'd definitely fall outside the "true regex superheroes" category.
Anyway, to simplify what I have in mind for us less-than-experts, it'd be neat if someone could put together a railroad diagram of a regular expression that would then be compiled as the regex itself.
That being said, I don't have the presence of mind right now to determine if two different regexes can result in the same diagram in RegExper. If so, that kinda thoroughly breaks my idea.
> most of my regex-fu entails building a regex relatively close to what I want and then repeatedly throwing it at a local instance of RegExper and test strings until I have something which accomplishes what I'm looking for it to do.
> I'd definitely fall outside the "true regex superheroes" category.
I think you just gave the definition of "true regex superhero". Best regex programmers I know have a similar workflow.
I understand that the first email regex is simplified and as a result doesn't handle oddities such as weird symbols, quotations and IP addresses, but it should be able to handle modern TLDs. Not only do you have names longer than 4 characters, but internationalised domain names starting with --.
However you could argue that validating email via regex misses the point entirely. A simple, permissive regex is all you really need assuming you are actually sending an email to check that the account exists.
As a result I'm not into the idea of such a visualisation; you should be using regexps all the time, and internalising the rules. When that's not enough you have to go and read up. I'm not sure such a visualisation will help that much in those non-regular cases, simply because they won't always be available to hand.
Personally, I try to limit my usage as much as possible. Regexes are basically only useful for things I do in my terminal and occasionally verifying that data is well-formed. For practical parsing, it's almost always better (and runs faster) to use a more robust solution like parser combinators or lexer/parser generators. Even oftentimes for things that are seen as perfect regex use-cases (validating data or splitting strings), using a parsing solution will work better -- for example, you can't be sure that all the numbers in your data are small enough to not overflow the integers using regexes alone (or at least not without resorting to extremely long and unreadable regexes). Regexes are a tool that's easy to reach for, but a lot of the time it's a tool that will end up breaking on you eventually.
Stuff like this while well intentioned is ultimately harmful, regex always looked like total gibberish to me, then one weekend I sat myself down and actually learnt it, no more issues. It's really simpler than it seems and worth the effort to learn, programs like this just work as a crutch.
The point of this, as the article says, is the rare use of regexp. If you have to design a regex twice a year, taking one weekend to relearn it is too much trouble.
I'll be really interested to see others' reactions to this. My first impression when I glanced over the example construction was not good. I felt like it really didn't improve comprehension, but just forced me to try to learn a new way of seeing those symbols. Perhaps a visual regex "IDE" that completely abstracted the syntax would be a better approach.
Depends on what you mean by "parse". If all you want is to search a document that is known to be well-formed, find an element that meets a few criteria, and grab a value out of that element, you can sometimes get away with using regex to find a substring that "looks right" without actually parsing the document.
Running your document through an actual parser gives you access to more information about the structure of the document and the context of the elements of interest. Actually parsing your input is therefore more robust to unexpected variations than any of the superficially-cheaper alternatives that people try.
"The most interesting thing about the language is the string pattern matching capabilities. Here's an small(and very incomplete) example that extracts the parts of a simplified URL string:
LETTER = "abcdefghijklmnopqrstuvwxyz"
LETTERORDOT = "." LETTER
LETTERORSLASH = "/" LETTER
LINE = INPUT
LINE SPAN(LETTER) . PROTO "://" SPAN(LETTERORDOT) . HOST "/" SPAN(LETTERORSLASH) . RES
OUTPUT = PROTO
OUTPUT = HOST
OUTPUT = RES
END
In line 6, the contents of the LINE variable is matched against a pattern. The pattern contains the following elements:
1.The SPAN(LETTER) . PROTO "://" section says identify a sequence of letters followed by "://" and assign them to the variable called PROTO
2.The SPAN(LETTERORDOT) . HOST "/" secotion says take a sequence of letters and dots followed by "/" and assign then to the variable called HOST
3.Finally the last section takes the remaining letters and slash characters and assign them to the RES variable"
Any time I want to write some non trivial regex I use https://debuggex.com/ to check/write it.
It is also great to quickly find out what a regular expression that someone else write actually does.
For me the same. Funny, as Friedl does barely discuss the ERE and awk (DFA) implementations at length. It is a shame, as the `more interesting` NFA implementations have performance issues under some circumstances. I had hoped the dilemma and engineering decision between features of NFA and guaranteed performance of DFA would get a more realistic discussion.
For a discussion of the "DFA" engines, I'd recommend Russ Cox's article series.
The NFA implementations you speak of have problems because they use backtracking and take worst case exponential time. Most implementations are in fact not NFAs, since for example, an NFA is not a powerful enough tool to resolve backreferences (which are NP-complete).
Both NFAs and DFAs in fact have equivalent computational power, and either can be used to perform regular expression matching in linear time. There are of course lots of performance differences in practice.
Friedl's book is great for what it is: a guide to optimizing regexes that search via backtracking.
1. This is a fine idea to aid regex newbies in crafting their expressions. I see this as a gateway instead of a longterm tool. The expressions won't be optimal (by no fault of the tool), nor will they likely be complete, but that's not the point. If it helps reduce the barrier(s) to adoption of regular expressions, then I can heartily support it.
2. To the people who say they use regular expressions only a handful of times a year, thus it's not worthwhile to invest time in learning the syntax, I offer this: once you know it, you will use it far more often than you ever expected. Find & replace text, piping output, Nginx.conf editing, or even the REGEXP() function in MySQL. It's a valuable skillset in so many environments that I expect you will use weekly, if not daily.
3. Ultimately regular expressions, like everything, are extra difficult until you know all of the available tools in the toolbox. At that point, you may realize you wrote an unnecessarily complex expression simply because you didn't know better.