
Regex for Noobs – An Illustrated Guide - Rainymood
https://www.janmeppe.com/blog/regex-for-noobs/
======
Ocerge
Regex is one of those topics I just can't seem to commit to memory. The
concept is easy enough, and every time I need one I can usually work out what
to do after a short amount of research, but I feel like I forget immediately
afterwards. It may have to do with how regexes of any complexity look like
machine code to me.

~~~
dayvid
Regexone ([https://regexone.com/](https://regexone.com/)) is one of the best
online tutorials I’ve come across for anything. I got a decent grasp of Regex
after going through the series 1-2 times.

~~~
mettamage
Reminds me of websites like
[https://www.sqlteaching.com/](https://www.sqlteaching.com/) It's obviously
not for regexes but after that I understood basic SQL so much better.

(also I love comments that point out amazing learning resources with full
conviction)

~~~
jnbiche
The creator of regexone.com also created sqlbolt.com for SQL, which I also
highly recommend.

------
nineteen999
One thing I still like about Perl is how well the regex syntax is integrated
into the language, combined with being installed in just about every Linux
modern Linux system, makes it really accessible from your fingertips.

If you're in a hurry and just need to do some one-off slicing and dicing,
there's not much more powerful than being able to say, eg:

    
    
      $data = "foobarfoo";
      $data =~ s/bar/baz/g;
      print "$data\n";
    

versus python's more dramatic and formal:

    
    
      import re
      data = "foobarfoo"
      data = re.sub('bar', 'baz', data)
      print data
    

For some reason, the Perl =~ notation sticks in my head much more easily than
the Python method.

Of course you can string a bunch of sed commands together in a shell script as
well, but I find that becomes unwieldy pretty quickly in a lot of situations.

~~~
megaframe
Agreed.

In other languages a part of me always feels like if I'm using regex I'm doing
something wrong, because I used it so liberally in Perl. It feels like a Perl-
ism so I try to make a conscious effort not to carry those over as they tend
to be the source of inefficiencies.

~~~
masklinn
TBF for this specific case it's very much unnecessary to break out regexes:
most languages do have some sort of literal replacement string API:

    
    
        str.replace = replace(...)
            S.replace(old, new[, count]) -> str
        
            Return a copy of S with all occurrences of substring
            old replaced by new.  If the optional argument count is
            given, only the first count occurrences are replaced.
    

(though the documentation turns out more misleading than re.sub's: this states
it replaces all occurrences but really replaces all _non-overlapping_
occurrences, which re.sub actually spells out properly).

~~~
megaframe
Oddly enough I was asked once to determine if there were any differences
between Perl Regex and Python's implementation in an effort to convert a tool
I wrote into Python (so other developers could help)

I found only one very weird edge case worth noting, in Perl the regex "s/ a* /
x /g" (spaces added to prevent formatting) will turn the string "bac" into
"xbxxcx" because the "a _" technically means zero or more so it matches the
spaces between the characters, not so in python it creates the string "xbxcx"
because it matched the a in between one time and didn't count the empty
spaces. Slightly less accurate results since _* does mean zero or more so the
space between counts as zero characters.

~~~
asicsp
That was changed recently, I get 'xbxcx' on 3.5 and 'xbxxcx' on 3.7

And coming to Perl vs Python regex differences, there are too many to count,
Python's 3rd party module 'regex' is more comparable to Perl regex. For
example, Python doesn't support possesive quantifiers, subexpression calls,
\K, \G and so on

------
sandgraham
[https://regexone.com/](https://regexone.com/) is a great resource for
learning regex. Also try [https://regex101.com/](https://regex101.com/) as a
sandbox for experimenting and expanding your regex skills.

~~~
mettamage
Why not use sublime as your regex editor? What makes regex101 special?

~~~
m4tthumphrey
Can't speak for Sublime but 101 has inline highlighting and explanations per
regex feature/flag/pattern. It really is awesome.

------
obtino
The trouble with regular expressions is not their use, but the different
"standards" (or lack thereof) and figuring out what the interpreter supports.

~~~
Biganon
Yes, two issues are a) knowing what to escape and b) what is supported or not.
I was weirdly surprised to discover that there are some useful functionalities
that are basically supported nowhere... except in vim. I think it was
variable-length lookbehinds but don't quote me on this.

~~~
ygra
Lookbehind is restricted in many regex engines that support it (many don't at
all). Some require the lookbehind to be constant-length (so if you have
alternatives in there, they all have to have the same length, and you can't
use quantifiers, basically). Some require it to be finite-but-known-ahead-of-
time-length, so something like (?<=a{3,6}) is okay, but (?<=a{3,) is not. Also
(?<=a|bb) would be okay.

.NET's System.Text.RegularExpressions.Regex is one implementation that has no
restrictions on lookbehind. Having used PowerShell for so long it now happens
sometimes that I forget when writing regexes for other implementations, as
it's really convenient at times.

------
thrwway3473
Nice resource. You do start off making it a bit daunting:

>For most people without a formal CS education, regular expressions (regex)
can come off as something that only the most hardcore unix programmers would
dare to touch.

my experience wasn't as daunting as this at all.

this is my story: basically every regex I ever wrote always worked the first
time and I found it super easy. my intro was the Perl 5 "camel" book, i.e.
Programming Perl. never heard of regexes before that.

If you find this current tutorial we're discussing tricky or daunting, maybe
give the resource I just mentioned ("Programming Perl" for Perl 5) a go
because for whatever reason the explanation was super simple and writing a
regex was one of the easiest things I ever did no matter what I wanted to do.
I didn't know about the concept itself before I read that book. I've used it
to parse loads and loads of things, use it in my editors, etc. It's always so
easy.

I don't know if you can find the "Programming Perl" book, I tried to look and
found something like that here:
[https://www.cs.ait.ac.th/~on/O/oreilly/perl/prog/ch01_07.htm](https://www.cs.ait.ac.th/~on/O/oreilly/perl/prog/ch01_07.htm)

you can maybe ignore the stuff about Perl and still sort of follow the
tutorial.

just an alternative in case someone had their curiosity piqued by this
illustrated guide, but it doesn't quite "click" for them,, and wanted to see
the same thing written out in another teaching way. And a note that for me it
was always not daunting at all!

~~~
Ciberth
By any chance studied in Belgium, Ghent? Just curious for educational purpose
:)

~~~
thrwway3473
>By any chance studied in Belgium, Ghent?

No, I didn't. (at the time, bought the Perl book in the U.S., where I read
it.)

------
manjana
Regular expressions may appear intimidating to the complete newbie, but once
you grasp a few simple ideas I'd argue it is one of the easier topics in CS.

~~~
Biganon
And incredibly useful. I don't know if that's just me and the things I do with
a computer, but I feel like it's a knowledge I've used a million times in my
life, both for my pet projects and my job as a law school TA. What baffles me
is that it's a niche knowledge. Excel is insanely useful, but it's considered
a basic knowledge, or at least everyone has heard of it. Not regex.

~~~
gmac
Agreed. I teach a session on it every year to Economics PhDs and faculty — if
I can only get them to come. It's something they just don't know they don't
know.

[http://mackerron.com/text/text-
slides.pdf#page19](http://mackerron.com/text/text-slides.pdf#page19)

~~~
jacobevelyn
Wow, I just wanted to say thanks for sharing this. These slides are great.
Would you mind if I shared this link with some coworkers?

~~~
gmac
Share away!

------
chupa-chups
A small collection of gotchas to regexes: [https://www.rexegg.com/regex-
gotchas.html](https://www.rexegg.com/regex-gotchas.html)

------
asicsp
Shameless plug: I've written books [1] which focuses entirely on the specific
flavor supported by the tool or the programming language. I use plenty of
examples to present the features one by one. These can be used for learning
purposes as well as handy reference guide.

I think of regular expressions as a mini-programming language, and use loose
fitting analogies

    
    
        Anchors --> adding if conditions
        Alternation --> Conditional OR
        a(b|c)d = abd|acd --> a(b+c)d = abd+acd
        Quantifiers --> string repetition and range operators
        Dot metacharacter + quantifiers + alternation --> Conditional AND
        Capture group and backreference --> variable
        Subexpression call --> function
        Character class --> sets
        Lookarounds --> custom conditionals
        Flags --> like command line options
    

So far, I've written books for Ruby and Python regular expressions, and
BRE/ERE/PCRE/Rust for "GNU grep and ripgrep". Currently writing a book for GNU
sed.

[1]
[https://learnbyexample.github.io/books/](https://learnbyexample.github.io/books/)

------
theclaw
I consider myself to be fairly well versed in regex having used it for years,
but a recent task to match only stored procedure definitions including the
optional block comments immediately preceding them (if any) in large SQL
scripts proved completely exasperating. I was deep in the weeds trying to get
it to do optional positive look-behind assertions before I just gave up and
wrote some code to do it.

The gory details of how to just match block comments are best outlined by this
blog post detailing one kindred spirit's journey through the wilderness:
[https://blog.ostermiller.org/find-comment](https://blog.ostermiller.org/find-
comment), although this wasn't particularly helpful to my situation, it was
reassuring to know that I was not alone and might prove interesting to the
crowd here.

------
anon1253
Regex named capture groups are somewhat of an underused feature I find,
especially with things like simple parsers they help to keep stuff readable
[https://regular-expressions.mobi/named.html](https://regular-
expressions.mobi/named.html)

~~~
pvorb
The same is true for free-spacing mode and comments, which can improve the
readability of regular expressions a lot: [https://www.regular-
expressions.info/freespacing.html](https://www.regular-
expressions.info/freespacing.html)

------
danielparks
Has anybody seen an introduction to regexes that starts with finite automata?

I was introduced to them that way and it seemed very intuitive to me. I’m not
sure if I just have a head for regular expressions, or if the way they were
presented was especially good.

~~~
tsumnia
NC State's second semester CS course uses a regex model as an example to
Finite State Machines; however, I think it's only an example, not a deep dive
into it

------
bane
I've taught a few "intro to regex" classes and have settled on a very simple
explanation that seems to work even with fairly non-technical audiences. In
short:

1) Regular expressions and "regex" are different things. Regex is a superset
of regular expressions. But if you can understand regular expressions you can
understand a huge number of regexes and have the tools to understand the rest
of the syntax that the system uses.

2) There are many many many regular expression and regex dialects and engines.
It's useful to learn the dialect you are trying to use.

3) Originally, regular expressions were used to describe a generator of
strings. It can be helpful to approach writing regexes from this standpoint.
It's not about what it matches, but what strings the regex can produce. I've
had many students go from bewildered incomprehension to immediate eureka when
I describe this.

4) There are only three rules that the beginning regex user has to know:

i) Concatenation - you can make a bigger generator by combining one generator
after the other. For example: the regex 'a' and the regex 'b' can be
concatenated into 'ab' and produces only the string 'ab'.

ii) Alternation - you can specify one character or another using the pipe
character in most dialects '|'. For example the regex 'a|b' produces either
the string 'a' or 'b'.

iii) Repetition - the character ' __* ' is a postfix operator that means that
the character that preceeds it can be repeated zero to an infinite number of
times. For example: 'a __* ' produces the empty string, 'a', 'aa', 'aaa' and
so on.

5) By combining these (i.e. using concatenation) and the grouping operators
'(' ')' you can write a regex for almost anything.

6) Nearly all of the other character you see in regexes that don't make sense
are simply syntactic sugar that make it easier and shorter to write common
expressions in a more compact form. Examples:

Writing lots of alternations of single characters (called character classes)
\- 'a|b|c|d|e|f' can be written as [a-f] \- an expression that is the
alternation of every character except the newline character can be written as
'.' There's a whole host of these kinds of things.

Writing lots of different kinds of repetitions introduces lots of new
operators \- 'aa __* ' can be written as 'a+' where '+' means 'repeat the
previous character _1_ or more times \- 'a?' means 'repeat the previous
character 0 or one times' \- 'a{1,5}' means 'repeat the previous character 1
to 5 times' There's also an entire table of these things.

After that it's mostly practice and referring to the description of the
operators for the dialect you are using.

I usually cover capture groupings (e.g. \1, \2, \3 or $1, $2, $3) later on if
necessary as it requires people to have a good handle on the particulars of
their languages way of handling those things. At this time I'll also over the
'(?:.)' non-capture group operator if their dialect allows.

About as frequently I'll have to introduce '^' and '$' and begin and end
string operators and also deconflict it from '[^.]'.

All the crazy-ass back/forward reference stuff I usually leave out as people
almost never need or come across those except in pretty rare instances. In
those cases they usually have enough regex under their belt that it's just
another topic to pick up.

This doesn't get anybody to pro-wizard level and there's always edge cases and
weird Perl-golf stuff that's out there, but it'll get most people to generally
functional inside of a day or two and I'll just make myself available to spot
answer questions or remind them where they forgot something.

------
nodesocket
Very nice write up, and easy to read. The author does a great job explaining
the concepts. I do wish he would have dove a bit deeper though:

> [0-9][0-9]* (what this pattern matches is left as an exercise to the reader)

Some more real world examples to bring all the concepts home would have been
great.

~~~
noxer
The pattern is kinda nonsense. The first part [0-9] matches one digit in the
given range 0-9. The second part obviously the same but the * quantifier means
the set can occur 0 or many times.

The whole pattern therefore means one digit 0-9 followed by zero or many
digits in the range 0-9.

so "3" would match "21" would match "123456" would match "" would not match

The obviously "correct" pattern for this behavior would be [0-9]+ The + means
1 or more and thus makes repeating the set obsolete.

~~~
Rainymood
Author here. I actually did not know this. I used this very old book (Unix
Power Tools) and after finishing the chapter on regex I didn't recall seeing a
`+` operator (maybe I forgot to add it or skipped over it). But indeed the `+`
operator seems to make significantly more sense here.

Thanks!

~~~
new_guy
Genuine question. Why would you write a guide for 'noobs', when you don't know
the basics yourself? It seems kind of silly.

~~~
Rainymood
I seem to have deleted part of the title. The article is titled

>Regex For Noobs (like me!) - an illustrated guide

So I am painfully aware that I'm a noob too, I even call myself one! I wrote
this little piece because I felt that regex can be kind of intimidating for a
lot of people but that it's actually pretty OK and fun once you get the hang
of it. To get people over that initial barrier is what I aim to do with the
"guide".

------
hartator
I love [https://rubular.com](https://rubular.com)

------
ktpsns
I learnt regex when I started with PHP in the 2000s and subsequently when I
learned Perl. When I started studying copmuter science at the university, I
was supper baffled that there is a full theory of regular expressions: The one
of a regular (formal) language
([https://en.wikipedia.org/wiki/Regular_language](https://en.wikipedia.org/wiki/Regular_language)).

I recommend everybody who's interested in "looking behind the curtains" of
regexp, their powers and limitations, to dive in a crash course of formal
languages (Chomsky hierarchy). It can be useful to gain a feeling what a
regexp can do and what it cannot.

~~~
mlevental
the Chomsky hierarchy isn't the actual relevant hierarchy for formal languages
- it's the polynomial complexity hierarchy

[https://en.wikipedia.org/wiki/Polynomial_hierarchy](https://en.wikipedia.org/wiki/Polynomial_hierarchy)

------
radium3d
I can't memorize regex completely but my favorite cheat sheet for regular
expressions is this simple one:

[https://github.com/ashchan/cheatsheets/blob/master/misc/regu...](https://github.com/ashchan/cheatsheets/blob/master/misc/regular-
expressions-cheat-sheet-v2.pdf)

For more advanced expressions I just do a quick google search for existing
solutions before attempting to recreate the wheel.

------
carstenhag
If someone wants to practice doing regexes a bit, you can try out
regexcrossword.com or my app (based on the site's idea, but I'm not affiliated
to them)
[https://play.google.com/store/apps/details?id=de.chagemann.r...](https://play.google.com/store/apps/details?id=de.chagemann.regexcrossword)

Note: this is a new account with my real name, I used to have an account here
already.

------
gilboise
I love the style of the pencil highlighted handwriting!

~~~
Rainymood
Author here. Thanks a lot for the kind words, they really do mean a lot to me
:)

------
SanchoPanda
Wonderful intro, among the best if not the single best basic introductions
I've seen.

~~~
Rainymood
Author here, thanks a lot. Your comment means a lot to me!

------
aniham
Throw in + and I think this covers close to 100% of things I've ever had to
grep. Very nice little tutorial/refresher.

------
daolf
Nice write-up, would love something like this for more advanced concepts such
as lookahead and condition.

------
mrcactu5
[https://regexper.com/](https://regexper.com/)

------
senthil_rajasek
Word of caution: Regex is hard to maintain. They are harder to (re)read than
C-Style languages.

~~~
ridiculous_fish
Regex can be as easy to maintain as code, if you treat them like code:

1\. Don't put it all on one line

2\. Indent nested constructs

3\. Write comments

See for example Python's verbose regex syntax:
[https://docs.python.org/3/library/re.html#re.VERBOSE](https://docs.python.org/3/library/re.html#re.VERBOSE)

~~~
senthil_rajasek
The verbose mode helps a bit. I did a quick check and C# seems to have tricks
to accomplish "verbose" RE.

But still, if you have to comment RE to improve readability is that really an
improvement or one more thing to maintain?

My preference is to avoid RE in larger code bases.

------
stOneskull
If there's a 'suggestion box', I suggest to make part 2 about greediness

------
slimypi
Nice guide. I would have liked a little more detail. But this is great as it
is! Thanks

------
qwerty456127
"sufficiently advanced enough"? Did Arthur C. Clarke really say it this way?

------
blondin
i remember seeing some japanese anime-style programming guides but never
thought anything of them back then. i really miss these short illustrated
guides but i don't recall what company was publishing them.

~~~
bitofhope
Perhaps No Starch Press?
[https://nostarch.com/mg_databases.htm](https://nostarch.com/mg_databases.htm)

~~~
blondin
thank you so much. those were the books!

------
Waterluvian
I've never memorized any regex. It's so syntax dense without many intuitive
characters. Not sure this is a problem. They're basically write only. Please
always comment beside regexes what they intend on doing.

~~~
spraak
Also very helpful to have at least a few good unit tests to show what they
should be doing

------
j88439h84
The handwriting is not easy to read.

~~~
Rainymood
Author here, thanks for the feedback. I'll make sure the handwriting is
clearer next time. Thank you!

------
cusack
Best regex intro I’ve read

------
Endy
Well, this is a very good way of teaching people how to double their problems
effectively.

~~~
asicsp
Context: Some people, when confronted with a problem, think "I know, I'll use
regular expressions." Now they have two problems. — Jamie Zawinski

See also: [https://blog.codinghorror.com/regular-expressions-now-you-
ha...](https://blog.codinghorror.com/regular-expressions-now-you-have-two-
problems/)

