
Show HN: Regex Cheatsheet - geongeorgek
https://ihateregex.io/
======
robert_tweed
OK, these kinds of regex tools get posted quite often. I get it, regex is very
confusing at first. And some of these use-cases result in rather complex
expressions nobody should be forced to write from scratch (you are still
remembering to write unit tests for them though, right?)

But as someone who actually knows [some flavours of] regex fairly well, what I
would _really_ like, is a reference that covers all the subtle differences
between the various regex engines, along with community-managed documentation
(perhaps wiki pages) of which applications & API versions use which flavour of
regex.

For example, the other day I wanted to run a find on my NAS. I needed to use a
regex, but the Busybox version of find doesn't support the iregex option, so
all expressions are case-sensitive. With some googling, I was able to find out
that the default regex type is Emacs, but I wasn't able to find either a good
reference for exactly what Emacs regex does and doesn't support, nor any
information about how to set the "i" flag. In the end I had to manually
convert every character into a class (like [aA] for "a") which was tedious,
but quicker than trying to find a better solution or resorting to grep.

A related, annoyingly common pattern is that the documentation for `find`
states that `--regex` specifies a regex, but it does not state _which_ flavour
of regex. The documentation for certain versions of `find`, which support
alternative engines, note that the default is Emacs. From this I was able to
infer (perhaps wrongly) that the Busybox `find` uses Emacs-flavoured regex,
but ultimate I still had to resort to some trial-and-error. This problem is
all too common in API documentation.

~~~
justaj
Honestly, as a noob, this is one of the biggest reasons I have such a hard
time deciding to learn regex.

Python flavor would probably be different than PCRE, which is probably
different than JS flavor.

Even worse is that it might be too late to standardize all the regex flavors
because there is already _so much_ written in different regex flavors that it
just costs too much for them to become obsolete in the future.

This is really demotivating.

~~~
new_guy
> Honestly, as a noob, this is one of the biggest reasons I have such a hard
> time deciding to learn regex.

Clear your afternoon, and just learn it. Seriously, it takes a couple of hours
at best and then - BOOM - you're done for the rest of your life.

~~~
absorber
> you're done for the rest of your life.

If that were so easy then I don't think much of these cheatsheets would exist.

~~~
freehunter
The cheat sheets exist because people aren’t learning regex. You don’t need to
learn every flavor of regex, just the one or small number you need to know.
And once you know the basics, the differences are very minor.

------
crispyambulance
I use regex a lot but deliberately keep it simple.

One thing that confounded me often was positive and negative look-arounds. I
always got the expressions mixed up, until I just put the expressions into a
table like this...

    
    
                  look-behind  |  look-ahead
        ------------------------------------
        positive    (?<=a)b    |    a(?=b)
        ------------------------------------
        negative    (?<!a)b    |    a(?!b)
    
    

It's not hard, but for whatever reason my brain had trouble remembering the
usage because every time I looked it up, each of those expressions was nested
in a paragraph of explanation, and I could not see the simple intuitive
pattern.

Putting it into a simple visualization helps a lot.

Now, if I can find a similar mnemonic for backreferences !?

~~~
wahern
Maybe it's easier to remember that lookbehinds are evil from an implementation
standpoint, and even in Perl have arbitrary limitations. If you see
lookbehinds, look away! If you see lookaheads, go ahead.

~~~
glangdale
Oddly, lookbehinds are evil only in a specific backtracking world. We never
got around to implementing arbitrary lookarounds in Hyperscan
([https://github.com/intel/hyperscan](https://github.com/intel/hyperscan)) but
if we had done something in the automata world to handle lookaround,
lookbehinds are _way_ easier than lookaheads.

To handle a lookbehind, you really only need to occasionally 'AND' together
some states (not an operation you would normally do in a standard NFA whether
Glushkov or Thompson). To handle lookaheads... well, it gets ugly.

~~~
wahern
Even for non-fixed length expressions?

~~~
glangdale
Variable-length backward asserts are _fine_ in automata-land, which is a bit
peculiar.

------
darau1
Nobody pointed it out, but there's also
[https://regexr.com/](https://regexr.com/)

It's how I learned regex years ago, and I still use it today to test/build
more complex patterns.

~~~
strig
My go-to is [https://regex101.com/](https://regex101.com/)

~~~
darau1
Didn't know about this. Thanks!

~~~
chirss
We use it on slack and irc for debugging people's regular expressions all the
time. Being able to have 30 revisions to a base regex to troubleshoot is
fantastic.

Plus the quiz is awesome.

------
__tk__
I'm loving the graphs which for the first time in years are giving me an idea
of what an expression is actually doing. Just because the visualization is
kept in a form that is easy to understand with a programming background but
can also be translated to the expression itself in a straightforward manner.

~~~
noxToken
Graphs for these really hammer home the point that regular expressions aren't
magic. Parsers have so many abilities that when starting out, my expressions
were horribly inefficient and missed many corner cases. Learning to graph them
just like automata immediately made things easier.

When green devs are having trouble with regular expressions (and don't have a
formal computer science background), I like to give them a crash course in
DFAs.

------
geongeorgek
I used to spend hours trying to craft the perfect expression for my scraping
projects not realizing that I don't really know regex.

This tool is a cheat sheet that also explains the commonly used expressions so
that you understand it.

\- There is a visual representation of the regular expression (thanks to
regexpr)

\- The application shows matching strings which you can play around

\- Expressions can be edited and these are instantly validated

------
StavrosK
I love regex and have no trouble reading them, but still love this tool, great
job. I especially like the railroad diagrams, for those cases where I
brainfarted on a regex and it's doing something other than what I intended.
Thanks for this.

~~~
geongeorgek
I'm glad you like the tool <3 It will have a lot more content soon :)

~~~
chirss
If you want some help swing by #regex on efnet, happy to help.

------
lfglopes
I used to use this site [http://txt2re.com](http://txt2re.com) which is now
off the grid, at the least since yesterday. :(

Unlike most regex helpers, in this one you would start with the text you want
to filter/parse and then it would suggest you possible extractions.

Do you know any alternatives?

~~~
deadliftpro
same, looking for an alternative to txt2re.

------
rubyn00bie
Nice work on this!

Something subtle, but I quite loved the email regex is, IMHO, close to
perfect: \S+@\S+\\.\S+

Because the "perfect" one is just absurd, and no one realizes it's going to be
so fucking absurd until they start getting support cases and then go read
something like this:
[https://stackoverflow.com/a/201378/931209](https://stackoverflow.com/a/201378/931209)

> If you want to get fancy and pedantic, implement a complete state engine. A
> regular expression can only act as a rudimentary filter. The problem with
> regular expressions is that telling someone that their perfectly valid
> e-mail address is invalid (a false positive) because your regular expression
> can't handle it is just rude and impolite from the user's perspective.

~~~
p4lindromica
Even this regexp has false positives.

The `ai` ccTLD ran their own mail server at the root, so an address like
`a@ai` was a valid email address.

They serve a website at the tld root: [http://ai./](http://ai./)

------
philshem
I have a secret hobby of answering python + regex questions on stackoverflow
with pure python.

~~~
johnnylambada
Examples?

~~~
philshem
_secret_

------
vzidex
Very cool! The site that worked best for me to learn regex was
[https://regexcrossword.com/](https://regexcrossword.com/) \- after solving my
way through all of them (I got really hooked when I discovered the site) I
found I was alright at regex.

~~~
geongeorgek
Thank you for sharing that. looks good

------
adambowles
>/h.llo/ the '.' matches any one character other than a new line character...
matches 'hello', 'hallo' but not 'h llo'

in the cheatsheet is false.
([https://regexr.com/4tc48](https://regexr.com/4tc48))

`.` can match any character except linebreaks (including whitespace)

~~~
jodrellblank
`.` "can" match any character including linebreaks if the regex engine is in
re.DOTALL mode (Python) or SingleLine Mode (.Net).

------
dana321
One thing i've always missed from the Perl programming language is the regex
operators.

You could do:

    
    
      my $var='foo foo bar and more bar foo!!!';
    
      if($var=~/(foo|bar)/g){  # does the variable contain foo or bar?
    
        print "foo! $1 removing foo..\n";
    
        # remove our value..
    
        $var=~s/$1//g;
    
      }

~~~
radiac
So did I: [https://github.com/radiac/python-
perl/](https://github.com/radiac/python-perl/)

~~~
dana321
Awesome job, i did a hack bootstrapping the tokenizer to do the same thing in
php, didn't release it though.

------
asicsp
neat site! clicking an example opens up a playground with live update and
explanation and railroad diagrams, similar to sites like regex101[1] and
regulex[2]

one suggestion would be to mention clearly which tool/language is being used,
regex has no unified standard.. based on "Cheatsheet adapted" message at the
bottom, I think it is for JavaScript. I wrote a book on js regexp last year,
and I have post for cheatsheet too [3]

[1] [https://regex101.com/](https://regex101.com/)

[2] [https://jex.im/regulex](https://jex.im/regulex)

[3]
[https://learnbyexample.github.io/cheatsheet/javascript/javas...](https://learnbyexample.github.io/cheatsheet/javascript/javascript-
regexp-cheatsheet/)

~~~
geongeorgek
Totally agreed! Right now I only support javascript. But for everything shown
there, it's pretty much the same for most flavors

------
Glench
Plug for Verbal Expressions (no affiliation), which has an alternate way of
compiling more human-readable regexes for a dozen languages:
[http://verbalexpressions.github.io/](http://verbalexpressions.github.io/)

~~~
linusjs_
I remember that library. A year after I made regexpbuilder
[https://www.npmjs.com/package/regexpbuilder](https://www.npmjs.com/package/regexpbuilder)
that library suddenly appeared, and was basically a rip-off of the concept I
appear to have created (there was no such other library before regexpbuilder),
but is also fairly useless because it doesn't look like it could represent
more than about 10% of the possible regular expressions. Yet there was no
mention of my library at all in the readme of verbal expressions.

------
mimixco
This is awesome! Thank you! I hate regex, too, but I love your inline railroad
diagramming tool.

~~~
geongeorgek
Haha thank you <3

------
superasn
Regex are quite simple and useful but my only issue is with those recursive
things. Like how do you match balanced brackets? I have a regex (pcre) copy-
pasted for it but for the life of me I don't get it or maybe nod my head but
instantly ununderstand it. I wish there was a simple to understand doc that
teaches to me how I can match something like:

    
    
        "(this is inside a bracket (and this is nested or (double nested)))
    

P.S. I know token parsing is better for these things but still I just want to
learn the other thing too.

~~~
gizmo686
Balanced paranthesis are not a regular language, so it s theoretically
imposdible to match them with regular expressions.

In practice, most regexp implemenations you see are more powerful then regular
expressions. For instance, .net has a balancing groups feature [0] for exactly
this usecase.

[0] [https://regular-expressions.mobi/balancing.html?wlr=1](https://regular-
expressions.mobi/balancing.html?wlr=1)

~~~
superasn
The regex I've copy-pasted is this:

    
    
        $str = "(this is inside a bracket (and this is nested or (double nested)))";
        do {
            preg_match_all('~\(((?:[^\(\)]++|(?R))*)\)~', $str, $matches);
            echo $str = $matches[1][0] ?? '', "\n";
        } while($str);
    

Outputs this [1]:

    
    
        > this is inside a bracket (and this is nested or (double nested))
        > and this is nested or (double nested)
        > double nested
    
    

You're right that there is more processing involved (e.g. while loop) but I
still don't understand this part

    
    
        '~\(((?:[^\(\)]++|(?R))*)\)~'
    

[1] [https://rextester.com/MEH86820](https://rextester.com/MEH86820)

~~~
gizmo686
A couple of things going on here.

First, the "~" characters aren't really part of the regular expression. As far
as I can tell, they are delimeters to mark the start/stop of this. Often you
will see "/" used for this purpose.

Next is:

    
    
      \( ... \)
    

This matches a pattern that starts with the literal character '(' and ends
with ')', where what comes between them matches the elided portion. Since
parantheses have special meaning in regex, we need to espace these characters.

Continueing are way inward, we see:

    
    
        ( ... )
    

Which is non-escaped parentheses. This is a pattern group, and is used to
treat the pattern within it as a single unit. For example the pattern "ab _"
would match abbb, but not ababab, because the "_" (repeat) modifier only
applies to "b". However "(ab) _" matches "ababab", but not "abbbb". In this
case, there is no modifier, so these parantheses have no effect on what string
matches the overall expression. However, many implementations also use
paranthesis to define matching groups, which means they will return whatever
is captured within the parantheses as a match. Essentially, the pattern of:

    
    
      \(( ... )\) 
    

means, find a string that starts with '(' and ends with ')', and pull out
everything in the middle.

Next comes a simmilar construct:

    
    
      (?:...)
    

There are 2 things going on here. This matches whatever is being elided by
..., however the library does not return it a separate result. This is used
when you need to group things together within a regular expression, but do not
want that specific grouping returned as part of the result. The "_" here means
that the entire pattern can be matched any number (including 0) of times, and
should be matched as many times as possible.

Next is

    
    
       [^\(\)]
    

The square brackets indicate that you should match any character within a
particular set. The "^" in the beggining of square brackets means that you are
inverting the selection, so you will match any character except those
specified. The remaining characters, are paranthesis literals.

The first "+" indicates that the pattern should match 1 or more of the previus
entity. In the case of [^\\(\\)]+, this would mean that it can match one or
more non paranthesise characters.

The second "+" is different. Since quantifiers are not allowed to follow other
quantifiers, the above meaning does not apply, and the langauge was allowed to
overload the symbol. This modifies the previous quantifier to be greedy,
meaning it will consume as many characters as possible (e.g. all characters
until it hits a parenthesis). I don't think this is technically needed in this
case, but probably improves efficiency.

The next component is "|", which means to match either the pattern on the
left, or the right.

The next step is not a regular expression, but one of those "more powerful"
additions I mentioned. (?R) is a recursive match, and matches whatever the
overall expression matches. Eg, when your expression runs into a nested
paranthesis, it recurses and parses the substring as a balanced paranthesis
string.

Putting this all together (and ignoring whitespace while adding comments; as
most major regex engines have an option to allow you to do):

    
    
        \( #Start with an open parathesis
        (  #This is the beginning of the region I want to extract
          (?: #Group the following pattern together, but don't save the matching substring
            [^\(\)]++ # Match until a parenthesis character, assuming that would match at least 1 character
            | # Or
            (?R) #Match a string with balanced paranthesis (assuming that is what the overall regex does).
          )* #Repeat the preceeding pattern as many times as nessasary
        ) #End the region I want to extract
        \) #The next character should be a close paranthesis.
    

Looking at an example of this:

    
    
      (aaa(bbb))
    

First, we match "(". Then we try to match (?:[^\\(\\)]++|(?R))* as a matching
group.

This matches [^\\(\\)]++|(?R) as many times as necessary.

At this point, are remaing string is "aaa(bbb))".

Since the pattern we are matching this against is an "|" pattern, we have 2
options: we can either match against: [^\\(\\)]++, which would match "aaa", or
we could match against (?R), which would fail, since the first character is
not '('. As such, we match "aaa". Since this grouping was defined using (?:)
instead of (), we do not save "aaa" as a separate result

Next, since the group is modified by "*", we can either match another instance
of it, or move on to match the closing ")". The next character is not ')', so
are only option is to match another instance of "[^\\(\\)]++|(?R)"

At this point the remaining string is (bbb)), so [^\\(\\)]++ fails to match,
since it requires at least one character before the '('. However, now (?R)
works and matches (bbb).

Now are remaining string is ")" and our options are again to match either
"[^\\(\\)]++|(?R)", or ')'. At this point, neither [^\\(\\)]++ nor (?R) work,
so the only option is to leave the repetition and match the closing ')'.

~~~
superasn
Wow thanks for explaining it to me so wonderfully. Your explanation for the
double ++ really helped me since that part never made sense to me before. I
guess the ?R probably only works with PHP? I will try to make some more
examples for the ?R to try out today so I can learn the full power of it.

Again I'm so grateful to you for the explanation. One more thing I've learned
from it is next time a regex makes my head explode, I'll just break each
character in one line and write a comment next to it!

------
xxsaculxx
Nice tool! I personally use [https://regex101.com/](https://regex101.com/) as
I like the explanations and quick reference.

------
sylvanaar
Nothing will ever beat RegexBuddy when it comes to Regex tools. It is an
entire IDE just for regex, and has been my not-so-secret weapon for a decade
or more.

------
kitd
This is really cool!

2 points:

1\. it fiddled with my back button which is a bit annoying

2\. a better email sample is

    
    
        ^[^@]+@[^@]+\.[^@]+$ 
    

which removes the 2 ampersands problem.

~~~
laumars
Even that is wrong because you can have privately owned TLDs (I forget what
they're technically called) like .google

So sundar.pichai@google is technically a valid address (whether .google has
any MX records is another matter)

Regex shouldn't really be used for email addresses anyway because the only
reliable way to authenticate an email address is to literally send an email to
that address.

~~~
bduerst
AFAIK none of the TLDs allow for MX records on just the TLD

i.e. johndoe@com will never exist

~~~
laumars
I'm not on about gTLDs like .com, I'm on about the privately owned ones like
.amazon and .google

~~~
bduerst
I didn't say gTLDs. The _id est_ applies to all.

~~~
laumars
TLDs can be managed very differently depending on the owners so I don't think
it's safe to assume "applies to all" just because there is a rule in place for
some gTLDs.

------
dan_hawkins
Is there a bug? In regexp for IPv4:
[https://ihateregex.io/expr/ip](https://ihateregex.io/expr/ip) expression ends
with {3} but the diagram states "2 times" in lower right - shouldn't it say "3
times"?

~~~
jve
I think it says "repeat 2" times. So basically you'v already went through the
group and then 2 more times.

Because if I specify x{0,3}, i have 2 paths - around x and thru x + at most 2
more times

~~~
geongeorgek
Yep you are right

------
KenanSulayman
I don't understand why the Github repository lists regexper as the source of
the visual graph code but the frame only shows iHateRegex as watermark?

If the only thing that is embedded in that frame was taken entirely from a
different project, that project should at least be mentioned in the frame.

------
hyperpape
Really nice idea.

I found that you can see your own regex with railroad diagram by going to one
of the prepopulated examples and editing it. However, it wasn't clear to me
that's the intended use of the tool. It's either a little side-effect, or not
super-discoverable.

------
mNovak
I always refer back to [http://rexegg.com/](http://rexegg.com/) Not a tool as
such, but a good reference if you know how it works and just need to refresh
on syntax.

------
kazinator
There is no way I would just plop that IPv6 regex into any serious program. :)

------
Diti
For the love of god, PLEASE DON’T USE REGEX TO VALIDATE EMAIL. The RegEx of
this website ignores plus-addressing, for example. All you need to do to
validate email is send a verification email.

~~~
xiconfjs
not just the email regex is simplified (and at the end plain wrong). Also the
one for phone numbers is highly simplified and will not match all valid phone
numbers...

------
axegon
This is awesome but.... I don't hate regex. Matter of fact, I love regex.

~~~
geongeorgek
check out the ipv6 one :)

~~~
axegon
I've had to write regex for deeply proprietary SQL-like (the word "like" is a
big BIG stretch) language. This really is nothing. The regex itself was 4
pages long. AFAIK they still use it in production, almost 10 years later with
0 modifications.

¯\\_(ツ)_/¯

------
Amarok
^[a-z0-9_-]{3,15}$

The username reference doesn't match 16 characters as claimed

~~~
geongeorgek
I should match. the number 15 there means that repeat x up to 15 times. so
1+15=16.

looks good to me

~~~
aratauto
That is not correct. 15 is total maximum number of repeats including the first
one. Even the diagram on
[https://ihateregex.io/expr/username](https://ihateregex.io/expr/username)
correctly says that loop can be taken between 2 and 14 times.

------
chenster
For email specific regular expression, it's all covered on
[https://emailregex.com](https://emailregex.com)

------
binarysneaker
These regexs are garbage. Others have suggested better sites for learning how
to construct regexs, and stackoverflow has plenty of great examples.

~~~
geongeorgek
Why don't you link them with the comment

------
olalonde
Thumbs up for the relatable domain name.

~~~
geongeorgek
Glad you find it that way

------
esaym
Either I'm a regex wizard and don't know it, or perhaps I think I know
something but know nothing at all but I've never complained about using regex
expressions. I use them all the time without thought. Never quite figured out
the need for a cheatsheet either, your language of choice should have a good
documentation page for any specific supported syntax.

------
hamid_ra
love the idea! I would crowdsource it so people can add their regex and vote
on other people rexgexes!

------
ape4
The IPv6 regex is surprisingly complicated.

~~~
geongeorgek
Yeah. this is when you start to have 2 problems

------
samat
This is very neat, thank you!

------
blauditore
Would be nice to have a regex for parsing HTML...

 _grabs popcorn_

~~~
bmn__
Easy with a sufficiently powerful engine:
[https://stackoverflow.com/a/4234491](https://stackoverflow.com/a/4234491)

Relies on ?(DEFINE):
[http://p3rl.org/perlre#(DEFINE)](http://p3rl.org/perlre#\(DEFINE\))

~~~
quickthrower2
There is a good comment on that answer:

> To sum up: RegEx's are misnamed. I think it's a shame, but it won't change.
> Compatible 'RegEx' engines are not allowed to reject non-regular languages.
> They therefore cannot be implemented correctly with only Finte State
> Machines. The powerful concepts around computational classes do not apply.
> Use of RegEx's does not ensure O(n) execution time. The advantages of
> RegEx's are terse syntax and the implied domain of character recognition. To
> me, this is a slow moving train wreck, impossible to look away, but with
> horrible consequences unfolding

------
shawnyou
Good tool

