OK, these kinds of regex tools get posted quite often. I get it, regex is very confusing at first. And some of these use-cases result in rather complex expressions nobody should be forced to write from scratch (you are still remembering to write unit tests for them though, right?)
But as someone who actually knows [some flavours of] regex fairly well, what I would really like, is a reference that covers all the subtle differences between the various regex engines, along with community-managed documentation (perhaps wiki pages) of which applications & API versions use which flavour of regex.
For example, the other day I wanted to run a find on my NAS. I needed to use a regex, but the Busybox version of find doesn't support the iregex option, so all expressions are case-sensitive. With some googling, I was able to find out that the default regex type is Emacs, but I wasn't able to find either a good reference for exactly what Emacs regex does and doesn't support, nor any information about how to set the "i" flag. In the end I had to manually convert every character into a class (like [aA] for "a") which was tedious, but quicker than trying to find a better solution or resorting to grep.
A related, annoyingly common pattern is that the documentation for `find` states that `--regex` specifies a regex, but it does not state which flavour of regex. The documentation for certain versions of `find`, which support alternative engines, note that the default is Emacs. From this I was able to infer (perhaps wrongly) that the Busybox `find` uses Emacs-flavoured regex, but ultimate I still had to resort to some trial-and-error. This problem is all too common in API documentation.
Honestly, as a noob, this is one of the biggest reasons I have such a hard time deciding to learn regex.
Python flavor would probably be different than PCRE, which is probably different than JS flavor.
Even worse is that it might be too late to standardize all the regex flavors because there is already so much written in different regex flavors that it just costs too much for them to become obsolete in the future.
Honestly don't let this get you down, here's a learning plan (use regex101 to learn)
1) Learn PCRE regex.
2) Try regex golf or cross words to learn PCRE regex.
3) Take the quiz on regex101.
Once you're done with all 3:
Learn the minor/major differences in the other languages. There aren't many. For example this named capture group:
(?<somename>someregex)
Would look like this in a different language:
(?P<somename>someregex)
There's some differences about what language can and cannot do like recursion because someone thought it was a great idea to make javascript awful at regex, but that's besides the point. Regex is totally worth learning.
The cheat sheets exist because people aren’t learning regex. You don’t need to learn every flavor of regex, just the one or small number you need to know. And once you know the basics, the differences are very minor.
The basic regex is easy, infact an English word is a regex! A dot matches a single character. Star multiple of the previous character. Just that is useful for a lot of cases!
If you believe it is possible to become an expert in regular expressions as they exist in modern computer languages in "a couple of hours at best" you are delusional.
The O’Riley book “mastering regular expressions” has a whole section dedicated to it. As well as several tables. But it would be nice to have an online version.
And it's one one the best O'Reilly books. I went and checked because of your comment and just noticed there was a third edition that I missed, I have the second. Still a book worth studying.
RE2 syntax[1] is a pretty good option to learn, because it's mostly a "lowest common denominator" - if it works in RE2, it should work in PCRE, Python, Javascript, etc. The reverse isn't true - there is a bunch of syntax that RE2 doesn't support by design, often to constrain performance bounds.
Emacs regexps are unfortunately their own weird beast - they handle parentheses differently than other regexp engines, because Emacs assumes that you'll be running regexps on Lisp code a lot and want to easily match parentheses. The best documentation on that syntax is (confusingly) in the Elisp reference manual: https://www.gnu.org/software/emacs/manual/html_node/elisp/Sy....
IME Emacs provides a very pleasant way to write regexps using the rx library. ELPA also has the package xr, which converts Elisp regexps to rx format, and pcre2el converts PCRE to Elisp. So a regexp like
Agreed that rx is nice, but really only useful if you're writing elisp. 90%+ of people who need to interact with Emacs regexps aren't writing an elisp program - they're using Emacs interactively, or even using another program (busybox, GNU find, etc.) that uses Emacs regexps for historical reasons. For those people the differences in syntax between Emacs regexps and "normal" regex dialects are a pain.
I tend to go to https://www.regular-expressions.info when I need to find out which features are supported between dialects. Not always up-to-date, but has some good info.
Its like SQL - everyone has a dialect. For most things where a SQL/regex engine/parser isn't the core of what they do, it will never be a priority. The best approach IMO is something like this in priority order:
1. Stick to using the lowest common denominator like you did for case insensitivity.
2. If that becomes too cumbersome, then consider whether regex is the right tool for the job. Maybe you can use e.g Python/your favorite language with a known regex standard.
3. If there are no other tools and you're stuck with whatever flavor of regex one particular thing supports, only then invest time in learning the details. There is probably a book out there with the details even if there's no webpage.
You're totally right. Right now this tool only supports the javascript flavor of regex. That said, for all the simple expressions shown there it's more or less the same for most other engines. I guess that makes it okay.
It's not so bad going between JS, Ruby and Elixir regex (possibly due to my use of a smaller set of features), but VIM regex disappoint me time after time.
if you're on osx, the app Patterns is really good for testing regex, and also has quick references for a variety of regex 'engines' and also has decent matching explanations
I use regex a lot but deliberately keep it simple.
One thing that confounded me often was positive and negative look-arounds. I always got the expressions mixed up, until I just put the expressions into a table like this...
It's not hard, but for whatever reason my brain had trouble remembering the usage because every time I looked it up, each of those expressions was nested in a paragraph of explanation, and I could not see the simple intuitive pattern.
Putting it into a simple visualization helps a lot.
Now, if I can find a similar mnemonic for backreferences !?
Maybe it's easier to remember that lookbehinds are evil from an implementation standpoint, and even in Perl have arbitrary limitations. If you see lookbehinds, look away! If you see lookaheads, go ahead.
Oddly, lookbehinds are evil only in a specific backtracking world. We never got around to implementing arbitrary lookarounds in Hyperscan (https://github.com/intel/hyperscan) but if we had done something in the automata world to handle lookaround, lookbehinds are way easier than lookaheads.
To handle a lookbehind, you really only need to occasionally 'AND' together some states (not an operation you would normally do in a standard NFA whether Glushkov or Thompson). To handle lookaheads... well, it gets ugly.
It's something I really like about .NET's regular expressions. Lookbehind has no limitations and will just match backwards with all features you can use in other parts.
So depending on the language or flavor you're working in, running away isn't really necessary.
We use it on slack and irc for debugging people's regular expressions all the time. Being able to have 30 revisions to a base regex to troubleshoot is fantastic.
I'm loving the graphs which for the first time in years are giving me an idea of what an expression is actually doing. Just because the visualization is kept in a form that is easy to understand with a programming background but can also be translated to the expression itself in a straightforward manner.
Graphs for these really hammer home the point that regular expressions aren't magic. Parsers have so many abilities that when starting out, my expressions were horribly inefficient and missed many corner cases. Learning to graph them just like automata immediately made things easier.
When green devs are having trouble with regular expressions (and don't have a formal computer science background), I like to give them a crash course in DFAs.
I love regex and have no trouble reading them, but still love this tool, great job. I especially like the railroad diagrams, for those cases where I brainfarted on a regex and it's doing something other than what I intended. Thanks for this.
Something subtle, but I quite loved the email regex is, IMHO, close to perfect: \S+@\S+\.\S+
Because the "perfect" one is just absurd, and no one realizes it's going to be so fucking absurd until they start getting support cases and then go read something like this: https://stackoverflow.com/a/201378/931209
> If you want to get fancy and pedantic, implement a complete state engine. A regular expression can only act as a rudimentary filter. The problem with regular expressions is that telling someone that their perfectly valid e-mail address is invalid (a false positive) because your regular expression can't handle it is just rude and impolite from the user's perspective.
Very cool! The site that worked best for me to learn regex was https://regexcrossword.com/ - after solving my way through all of them (I got really hooked when I discovered the site) I found I was alright at regex.
One thing i've always missed from the Perl programming language is the regex operators.
You could do:
my $var='foo foo bar and more bar foo!!!';
if($var=~/(foo|bar)/g){ # does the variable contain foo or bar?
print "foo! $1 removing foo..\n";
# remove our value..
$var=~s/$1//g;
}
neat site! clicking an example opens up a playground with live update and explanation and railroad diagrams, similar to sites like regex101[1] and regulex[2]
one suggestion would be to mention clearly which tool/language is being used, regex has no unified standard.. based on "Cheatsheet adapted" message at the bottom, I think it is for JavaScript. I wrote a book on js regexp last year, and I have post for cheatsheet too [3]
Plug for Verbal Expressions (no affiliation), which has an alternate way of compiling more human-readable regexes for a dozen languages: http://verbalexpressions.github.io/
I remember that library. A year after I made regexpbuilder https://www.npmjs.com/package/regexpbuilder that library suddenly appeared, and was basically a rip-off of the concept I appear to have created (there was no such other library before regexpbuilder), but is also fairly useless because it doesn't look like it could represent more than about 10% of the possible regular expressions. Yet there was no mention of my library at all in the readme of verbal expressions.
Regex are quite simple and useful but my only issue is with those recursive things. Like how do you match balanced brackets? I have a regex (pcre) copy-pasted for it but for the life of me I don't get it or maybe nod my head but instantly ununderstand it. I wish there was a simple to understand doc that teaches to me how I can match something like:
"(this is inside a bracket (and this is nested or (double nested)))
P.S. I know token parsing is better for these things but still I just want to learn the other thing too.
Balanced paranthesis are not a regular language, so it s theoretically imposdible to match them with regular expressions.
In practice, most regexp implemenations you see are more powerful then regular expressions. For instance, .net has a balancing groups feature [0] for exactly this usecase.
$str = "(this is inside a bracket (and this is nested or (double nested)))";
do {
preg_match_all('~\(((?:[^\(\)]++|(?R))*)\)~', $str, $matches);
echo $str = $matches[1][0] ?? '', "\n";
} while($str);
Outputs this [1]:
> this is inside a bracket (and this is nested or (double nested))
> and this is nested or (double nested)
> double nested
You're right that there is more processing involved (e.g. while loop) but I still don't understand this part
First, the "~" characters aren't really part of the regular expression. As far as I can tell, they are delimeters to mark the start/stop of this. Often you will see "/" used for this purpose.
Next is:
\( ... \)
This matches a pattern that starts with the literal character '(' and ends with ')', where what comes between them matches the elided portion. Since parantheses have special meaning in regex, we need to espace these characters.
Continueing are way inward, we see:
( ... )
Which is non-escaped parentheses. This is a pattern group, and is used to treat the pattern within it as a single unit. For example the pattern "ab" would match abbb, but not ababab, because the "" (repeat) modifier only applies to "b". However "(ab)" matches "ababab", but not "abbbb". In this case, there is no modifier, so these parantheses have no effect on what string matches the overall expression. However, many implementations also use paranthesis to define matching groups, which means they will return whatever is captured within the parantheses as a match. Essentially, the pattern of:
\(( ... )\)
means, find a string that starts with '(' and ends with ')', and pull out everything in the middle.
Next comes a simmilar construct:
(?:...)
There are 2 things going on here. This matches whatever is being elided by ..., however the library does not return it a separate result. This is used when you need to group things together within a regular expression, but do not want that specific grouping returned as part of the result. The "" here means that the entire pattern can be matched any number (including 0) of times, and should be matched as many times as possible.
Next is
[^\(\)]
The square brackets indicate that you should match any character within a particular set. The "^" in the beggining of square brackets means that you are inverting the selection, so you will match any character except those specified. The remaining characters, are paranthesis literals.
The first "+" indicates that the pattern should match 1 or more of the previus entity. In the case of [^\(\)]+, this would mean that it can match one or more non paranthesise characters.
The second "+" is different. Since quantifiers are not allowed to follow other quantifiers, the above meaning does not apply, and the langauge was allowed to overload the symbol. This modifies the previous quantifier to be greedy, meaning it will consume as many characters as possible (e.g. all characters until it hits a parenthesis). I don't think this is technically needed in this case, but probably improves efficiency.
The next component is "|", which means to match either the pattern on the left, or the right.
The next step is not a regular expression, but one of those "more powerful" additions I mentioned. (?R) is a recursive match, and matches whatever the overall expression matches. Eg, when your expression runs into a nested paranthesis, it recurses and parses the substring as a balanced paranthesis string.
Putting this all together (and ignoring whitespace while adding comments; as most major regex engines have an option to allow you to do):
\( #Start with an open parathesis
( #This is the beginning of the region I want to extract
(?: #Group the following pattern together, but don't save the matching substring
[^\(\)]++ # Match until a parenthesis character, assuming that would match at least 1 character
| # Or
(?R) #Match a string with balanced paranthesis (assuming that is what the overall regex does).
)* #Repeat the preceeding pattern as many times as nessasary
) #End the region I want to extract
\) #The next character should be a close paranthesis.
Looking at an example of this:
(aaa(bbb))
First, we match "(". Then we try to match (?:[^\(\)]++|(?R))* as a matching group.
This matches [^\(\)]++|(?R) as many times as necessary.
At this point, are remaing string is "aaa(bbb))".
Since the pattern we are matching this against is an "|" pattern, we have 2 options: we can either match against: [^\(\)]++, which would match "aaa", or we could match against (?R), which would fail, since the first character is not '('. As such, we match "aaa". Since this grouping was defined using (?:) instead of (), we do not save "aaa" as a separate result
Next, since the group is modified by "*", we can either match another instance of it, or move on to match the closing ")". The next character is not ')', so are only option is to match another instance of "[^\(\)]++|(?R)"
At this point the remaining string is (bbb)), so [^\(\)]++ fails to match, since it requires at least one character before the '('. However, now (?R) works and matches (bbb).
Now are remaining string is ")" and our options are again to match either "[^\(\)]++|(?R)", or ')'. At this point, neither [^\(\)]++ nor (?R) work, so the only option is to leave the repetition and match the closing ')'.
Wow thanks for explaining it to me so wonderfully. Your explanation for the double ++ really helped me since that part never made sense to me before. I guess the ?R probably only works with PHP? I will try to make some more examples for the ?R to try out today so I can learn the full power of it.
Again I'm so grateful to you for the explanation. One more thing I've learned from it is next time a regex makes my head explode, I'll just break each character in one line and write a comment next to it!
I guess I don't understand. Mind throwing up an example with multiple test strings on regex101.com ? I'd like to take a look and see if I can make a regex which does what you want.
So if you could write the examples there, and then a description like you would tell your mom of what you want I'll see what I can do.
Nothing will ever beat RegexBuddy when it comes to Regex tools. It is an entire IDE just for regex, and has been my not-so-secret weapon for a decade or more.
Even that is wrong because you can have privately owned TLDs (I forget what they're technically called) like .google
So sundar.pichai@google is technically a valid address (whether .google has any MX records is another matter)
Regex shouldn't really be used for email addresses anyway because the only reliable way to authenticate an email address is to literally send an email to that address.
TLDs can be managed very differently depending on the owners so I don't think it's safe to assume "applies to all" just because there is a rule in place for some gTLDs.
Is there a bug? In regexp for IPv4: https://ihateregex.io/expr/ip expression ends with {3} but the diagram states "2 times" in lower right - shouldn't it say "3 times"?
I found that you can see your own regex with railroad diagram by going to one of the prepopulated examples and editing it. However, it wasn't clear to me that's the intended use of the tool. It's either a little side-effect, or not super-discoverable.
For the love of god, PLEASE DON’T USE REGEX TO VALIDATE EMAIL. The RegEx of this website ignores plus-addressing, for example. All you need to do to validate email is send a verification email.
not just the email regex is simplified (and at the end plain wrong). Also the one for phone numbers is highly simplified and will not match all valid phone numbers...
I've had to write regex for deeply proprietary SQL-like (the word "like" is a big BIG stretch) language. This really is nothing. The regex itself was 4 pages long. AFAIK they still use it in production, almost 10 years later with 0 modifications.
That is not correct. 15 is total maximum number of repeats including the first one. Even the diagram on https://ihateregex.io/expr/username correctly says that loop can be taken between 2 and 14 times.
Either I'm a regex wizard and don't know it, or perhaps I think I know something but know nothing at all but I've never complained about using regex expressions. I use them all the time without thought. Never quite figured out the need for a cheatsheet either, your language of choice should have a good documentation page for any specific supported syntax.
> To sum up: RegEx's are misnamed. I think it's a shame, but it won't change. Compatible 'RegEx' engines are not allowed to reject non-regular languages. They therefore cannot be implemented correctly with only Finte State Machines. The powerful concepts around computational classes do not apply. Use of RegEx's does not ensure O(n) execution time. The advantages of RegEx's are terse syntax and the implied domain of character recognition. To me, this is a slow moving train wreck, impossible to look away, but with horrible consequences unfolding
Nope. <h1 class="foo>bar">My First Heading</h1> will misparse. (This is valid HTML 5.) You really need recursive regex or something equivalent in power, otherwise you will always fail.
But as someone who actually knows [some flavours of] regex fairly well, what I would really like, is a reference that covers all the subtle differences between the various regex engines, along with community-managed documentation (perhaps wiki pages) of which applications & API versions use which flavour of regex.
For example, the other day I wanted to run a find on my NAS. I needed to use a regex, but the Busybox version of find doesn't support the iregex option, so all expressions are case-sensitive. With some googling, I was able to find out that the default regex type is Emacs, but I wasn't able to find either a good reference for exactly what Emacs regex does and doesn't support, nor any information about how to set the "i" flag. In the end I had to manually convert every character into a class (like [aA] for "a") which was tedious, but quicker than trying to find a better solution or resorting to grep.
A related, annoyingly common pattern is that the documentation for `find` states that `--regex` specifies a regex, but it does not state which flavour of regex. The documentation for certain versions of `find`, which support alternative engines, note that the default is Emacs. From this I was able to infer (perhaps wrongly) that the Busybox `find` uses Emacs-flavoured regex, but ultimate I still had to resort to some trial-and-error. This problem is all too common in API documentation.