Hacker News new | past | comments | ask | show | jobs | submit login
Regex for Noobs – An Illustrated Guide (janmeppe.com)
541 points by Rainymood on Aug 4, 2019 | hide | past | favorite | 105 comments



Regex is one of those topics I just can't seem to commit to memory. The concept is easy enough, and every time I need one I can usually work out what to do after a short amount of research, but I feel like I forget immediately afterwards. It may have to do with how regexes of any complexity look like machine code to me.


In addition to the 'practice' stated elsewhere, the other thing someone needs to really get to grips with something like regex is a 'reason'.

I never really got the point of regexes until I discovered capture groups. Pretty much everything I use them for is based on some problem like:

"Change date format from 'DD-MM-YY' to 'YYYY-MM-DD'"

"Keep only lines in the log file with a date in them"

"Swap the first name and last name of every contact in this list which has a comma in it"

"Rename all the files in this folder to strip out all the gubbins and put the useful stuff in a standard format"

And so on. You need a problem where doing it any other way than regex would be lengthy and painful and the regex itself is light and breezy. You build up little snippets or "phrases" that you can re-use easily to solve new problems.


Regexone (https://regexone.com/) is one of the best online tutorials I’ve come across for anything. I got a decent grasp of Regex after going through the series 1-2 times.


Reminds me of websites like https://www.sqlteaching.com/ It's obviously not for regexes but after that I understood basic SQL so much better.

(also I love comments that point out amazing learning resources with full conviction)


The creator of regexone.com also created sqlbolt.com for SQL, which I also highly recommend.


I haven't seen it mentioned here yet, but wanted to throw out (https://regexr.com/) as a decent resource into writing regular expressions. I think I haven't needed to use a tutorial because of how easy it makes writing them.


I'll second this, it's immensely useful when I can't wrap my head around a regular expression.


regex101.com is my go-to, because it supports regex flags and Python regexes.


Going to add regexper.com which is really good for explaining RegEx patterns.


Strongly agree with this recommendation. I ran through regexone.com once, and even though I already knew regexes, it helped me fill in some stuff I had gotten rusty on.

I highly recommend the lessons. If you do them, I also highly recommend donating the recommended $4 via Paypal (or more), although it's technically free to use. It's well worth it.


Practice, my friend. You might have developed a systematic methodology for quickly learning and internalising (many) other things. (Kudos.) But sometimes it just comes down to wrote practice and repetition; time and effort might be key here for your mastery of this subject.


>time and effort might be key here for your mastery of this subject.

It doesn't have to be this way. If most programming languages were designed like regular expressions they would be unusable. I can spend a year w/o writing a single line of Pascal or C++ and then it would come back immediately. I occasionally found regular expressions useful in my work, but only a couple of times could justify spending time on learning a little bit of it only to forget the next day. In a few other cases I simply googled a solution and used w/o understanding it. Most often I just write code instead, it is verbose but at least it's clear what it does.


I suppose what I'm trying to say is that from personal and anecdotal experience I got 97% in a first year calculus course. I went in with a marginally good algebra result, but I resolved to learn the shit out of everything that came my way.

I systemised as much as I could, but there was a lot of write practice when it came to learning trigonomic idendities and transidnetal functions. And not just the normal ones; I mean all of them ... both the inverserse and hyperbolic identities.

I did all of the problems and got extra ones.

In short, I knew the material backwards and forwards; I finished hour long exams in 20 minutes, and was usually the first or second person to leave.

What I am trying to say is that REGEX is similar to calculas and in my view requires significant focus and practice to master. Or you can muddle your way through it when you have to.


To add to that, regex mojo has just kind of built up in my head over years. It's not something a topic I was able to study, learn, and retain (well). Every time I stumble onto a regex use case I need to understand & don't, I level up at that time, and that's always been totally fine.

A few months ago I needed to add a regex for validating IPv4 addresses that were in a non-standard format. I had to do a little research, then I wrote & tested it, and promptly forgot 90% of what I had just learned.


The problem I have with learning regex is similar to the above. It's easy enough to (re)learn when you need it and it doesn't come up often enough for me to consider actually learning it as being useful.


This is a new and excellent way to actually remember how to write regex https://www.executeprogram.com/

Spaced repetition works!


I did a static analysis code tool based on regex and only after writing hundreds of them, some of thousands in length, regex clicked.


That was how I was as well until I got tasked with making feature rich text inputs (decorate hashtags, user tags, replace names with emails, vice versa). At a certain point it gets too difficult to respond to user input as it is just too variable (user highlights 40 chars and presses ‘A’? user pastes 100 chars?). A bulletproof method is running find and replace when needed and the ideal way to do that is regex.

Forcing myself to get tough regex working is what really made me grasp some of the more advanced concepts that unlock regex. It’s mainly just that we never need regex, and if you use it for a simple task you’re a jerk for introducing unnecessary complexity. Why would people learn it?


It's because it's terse with quite a low amount of mnemonics (a meaningful feature is only 1-2 characters, there's no padding for a spacial understanding of hierarchies, etc.., compared to your host language), and even though it comes in very handy, most developers aren't spending a large amount of time writing them in your average project.


I learned first about Regular Expressions i.e. the finite automata that do a simpler subset of what Regex can do. Learning about this and drawing out the graphs etc, may help commit to memory.

Also start using them to help you find and replace stuff in a text editor. They are very handy for converting for example a csv into code.


> It may have to do with how regexes of any complexity look like machine code to me.

Regexes get a bad rap because programmers who write otherwise maintainable code throw code hygiene out the window when writing long regexes. Your host language is much more complicated than the regex language, but writing shitty unreadable regex is, for whatever reason, acceptable. Some unfortunate and unnecessary limitations of typical regex libraries make the problem worse.

1. Regexes are typically written as one-line strings instead of as structu/red code. When writing C/Java, programmers understand that they should put each element of a sequence on a separate line and use indentation to visually signal branching and loops. But for some reason when writing character-processing programs that also have sequences, branching (|), and loops, programmers almost always golf it and put the whole complex program all on one line.

If you're writing a regex longer than 5 or 10 atoms, place each sequence on a separate line and use newlines+indentation to visually offset disjunctive choices (|) and loops (star).

Rule of thumb: you should be able to roughly sketch the rough shape of the finite automata by crossing your eyes and eyeballing the shape of the regex, just like you should be able to roughly sketch the control flow of a program by crossing your eyes and eyeballing the shape of the code.

2. Character group names are too terse (e.g., \s instead of "\whitespace" and \d instead of \digit), probably because of #1. Also, very few people give names for long subexpressions (e.g., factoring code out into functions with well-chosen names).

3. Gotos/try...catch (i.e., backtracking) are not used judiciously and aren't well-documented/tested. I often see backtracking or confusing mixes of lazy and greedy matching instead of just writing out a slightly longer disjunction.

4. Regexes are often used for languages that aren't even almost regular. A bit of backtracking is OK (gotos are sometimes OK), but if there's a lot of backtracking then you need to use a different class of languages/machines.

5. There's no way to embed non-regular matchers into a regex.

Due to the combination of 1-5, matching an email address with an optional recipient name (so something like "asdf@asdf.com" or something like "John Smith <asdf@asdf.com>") you need an insane regex that many people implement with lots of backtracking and so on all stuffed into a single line.

But something like this would work just fine and is much more readable:

    \emailAddress := ...
    # todo: need to support dashes in names.
    \name := (
        [a-zA-Z]*
        \whitespace?
        [a-zA-Z]*
    )
    # matches asdf@asdf.com or John Smith <asdf@asdf.com>
    (
        \emailAddress
    )
    OR
    (
        \name
        \whitespace? 
        <
        \emailAddress
        >
    )
where e.g., the implementation of \emailAddress could be written as a stand-alone parser in the host language. But even without digging into email address you can already see how this is way more readable than:

    \emailAddress|[a-zA-Z]*\s?[a-zA-Z]*\s?<\emailAddress>
Writing readable regular expressions shouldn't be difficult -- just treat the regex like any other piece of code and allow inter-op with the host language. But few people/libraries put in the effort, and for whatever reason golfing regexes in production code is considered acceptable even in orgs where you'd be fired for code golfing in the host programming language.


Lua's Lpeg module (http://www.inf.puc-rio.br/~roberto/lpeg/) is probably what you are after:

     lower = lpeg.R("az")
     upper = lpeg.R("AZ")
     letter = lower + upper
Personally I prefer though the terseness of regular expressions.


Python has another somewhat reasonable solution. Either of those solutions can be combined with good programming to constitute a reasonable solution.

> Personally I prefer though the terseness of regular expressions.

I think there are legitimate use-cases for both.

If you're quickly hacking out a small good-enough parser for something regular or "almost regular", terseness can be great.

However, if you're parsing a large regular language, terseness isn't really a benefit. Perl is on its bed for a reason, and overly terse regular expressions should die for a similar reason. Overly-clever write-only coding culture sucks.

But the terseness of regular expressions is basically terrible beyond maybe a few hundred characters. E.g., the following has no place in production code -- you might as well just include a binary:

    (?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
 )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:
 \r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(
 ?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[
 \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\0
 31]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\
 ](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+
 (?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:
 (?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
 |(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)
 ?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\
 r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[
 \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)
 ?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t]
 )*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[
 \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*
 )(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]
 )+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)
 *:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+
 |\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r
 \n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:
 \r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t
 ]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031
 ]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](
 ?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?
 :(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?
 :\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)|(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?
 :(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?
 [ \t]))*"(?:(?:\r\n)?[ \t])*)*:(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\]
 \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|
 \\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>
 @,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"
 (?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t]
 )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
 ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?
 :[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[
 \]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:[^()<>@,;:\\".\[\] \000-
 \031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(
 ?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)?[ \t])*(?:@(?:[^()<>@,;
 :\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([
 ^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\"
 .\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\
 ]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\
 [\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\
 r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\]
 \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]
 |\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?(?:[^()<>@,;:\\".\[\] \0
 00-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\
 .|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,
 ;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?
 :[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*))*@(?:(?:\r\n)?[ \t])*
 (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
 \[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t])*(?:[
 ^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]
 ]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(?:\r\n)?[ \t])*)(?:,\s*(
 ?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
 ".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(
 ?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[
 \["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t
 ])*))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t
 ])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?
 :\.(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|
 \Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*|(?:
 [^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\
 ]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)*\<(?:(?:\r\n)
 ?[ \t])*(?:@(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["
 ()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)
 ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>
 @,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*(?:,@(?:(?:\r\n)?[
 \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,
 ;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n)?[ \t]
 )*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\
 ".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*)*:(?:(?:\r\n)?[ \t])*)?
 (?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".
 \[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:
 \r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z|(?=[\[
 "()<>@,;:\\".\[\]]))|"(?:[^\"\r\\]|\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])
 *))*@(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])
 +|\Z|(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*)(?:\
 .(?:(?:\r\n)?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t])+|\Z
 |(?=[\["()<>@,;:\\".\[\]]))|\[([^\[\]\r\\]|\\.)*\](?:(?:\r\n)?[ \t])*))*\>(?:(
 ?:\r\n)?[ \t])*))*)?;\s*)


Thanks for your post, I learned a lot and I feel excited when I come across an experience to use a regex again to put this into practice.


One thing I still like about Perl is how well the regex syntax is integrated into the language, combined with being installed in just about every Linux modern Linux system, makes it really accessible from your fingertips.

If you're in a hurry and just need to do some one-off slicing and dicing, there's not much more powerful than being able to say, eg:

  $data = "foobarfoo";
  $data =~ s/bar/baz/g;
  print "$data\n";
versus python's more dramatic and formal:

  import re
  data = "foobarfoo"
  data = re.sub('bar', 'baz', data)
  print data
For some reason, the Perl =~ notation sticks in my head much more easily than the Python method.

Of course you can string a bunch of sed commands together in a shell script as well, but I find that becomes unwieldy pretty quickly in a lot of situations.


Agreed.

In other languages a part of me always feels like if I'm using regex I'm doing something wrong, because I used it so liberally in Perl. It feels like a Perl-ism so I try to make a conscious effort not to carry those over as they tend to be the source of inefficiencies.


TBF for this specific case it's very much unnecessary to break out regexes: most languages do have some sort of literal replacement string API:

    str.replace = replace(...)
        S.replace(old, new[, count]) -> str
    
        Return a copy of S with all occurrences of substring
        old replaced by new.  If the optional argument count is
        given, only the first count occurrences are replaced.
(though the documentation turns out more misleading than re.sub's: this states it replaces all occurrences but really replaces all non-overlapping occurrences, which re.sub actually spells out properly).


Oddly enough I was asked once to determine if there were any differences between Perl Regex and Python's implementation in an effort to convert a tool I wrote into Python (so other developers could help)

I found only one very weird edge case worth noting, in Perl the regex "s/ a* / x /g" (spaces added to prevent formatting) will turn the string "bac" into "xbxxcx" because the "a" technically means zero or more so it matches the spaces between the characters, not so in python it creates the string "xbxcx" because it matched the a in between one time and didn't count the empty spaces. Slightly less accurate results since * does mean zero or more so the space between counts as zero characters.


That was changed recently, I get 'xbxcx' on 3.5 and 'xbxxcx' on 3.7

And coming to Perl vs Python regex differences, there are too many to count, Python's 3rd party module 'regex' is more comparable to Perl regex. For example, Python doesn't support possesive quantifiers, subexpression calls, \K, \G and so on


Since Perl 5.14, it could be simplified to

  $data = "foobarfoo"
  say $data =~ s/bar/baz/gr
say is like print, but adds a newline as well.

The /r option does a non destructive replace and returns the result.


A question of taste, disposition and habituation, supposedly. I far prefer the Python way for clarity and immediate intelligibility, no magic spells required.


I much prefer the Perl way for writing short disposable stuff and much the Python way if any other person (including me in more than 1-2 weeks) is going to use/modify/read the code.

The gap between perl and python seems cleaner than the gap between python and anything-statically-typed, at least.


https://regexone.com/ is a great resource for learning regex. Also try https://regex101.com/ as a sandbox for experimenting and expanding your regex skills.


Regex101 is basically my "regex editor", to the point that I have a keybinding in my normal code editor that opens it in the browser for me.

It's such a good resource for understanding and writing both simple and complicated regex.


I want to call out https://www.regular-expressions.info/ since it was, for a long time, one of the better/best resources on regexes that I was able to find. I learned a lot from this ... guy, essentially.


This website is great and well worth a donation if you've benefited from it in the past.

As a "next level" site, also give https://rexegg.com a look.


regex101 and similar sites like [1] are great

but, I always try to add a warning when recommending - should use them only for the flavors supported (PCRE, Python, etc) I've seen many using it for cli tools and wonder why things like non-greedy or lookarounds don't work.

[1] https://www.debuggex.com/


Why not use sublime as your regex editor? What makes regex101 special?


Can't speak for Sublime but 101 has inline highlighting and explanations per regex feature/flag/pattern. It really is awesome.


It's like the difference between writing C++ in Nano and writing C++ in Visual Studio.


The trouble with regular expressions is not their use, but the different "standards" (or lack thereof) and figuring out what the interpreter supports.



Yes, two issues are a) knowing what to escape and b) what is supported or not. I was weirdly surprised to discover that there are some useful functionalities that are basically supported nowhere... except in vim. I think it was variable-length lookbehinds but don't quote me on this.


Lookbehind is restricted in many regex engines that support it (many don't at all). Some require the lookbehind to be constant-length (so if you have alternatives in there, they all have to have the same length, and you can't use quantifiers, basically). Some require it to be finite-but-known-ahead-of-time-length, so something like (?<=a{3,6}) is okay, but (?<=a{3,) is not. Also (?<=a|bb) would be okay.

.NET's System.Text.RegularExpressions.Regex is one implementation that has no restrictions on lookbehind. Having used PowerShell for so long it now happens sometimes that I forget when writing regexes for other implementations, as it's really convenient at times.


yeah, typically you would get error if you use variable length lookbehinds

a few do support it (for ex: Python's 3rd party regex module) and sometimes you could workaround with \K (similar to \zs in vim) [1]

And there are other frustrating differences between implementations, for example \g definition is very different between Ruby and Python, character set operations are not found everywhere, etc. Plus, BRE/ERE versions found in command line tools do not even support features like non-greedy and lookaround

[1] https://stackoverflow.com/questions/11640447/variable-length...


My knowledge of Regex is basic.Last year,I ended up writing a piece of code,which was reading an inbound email,parsing some of the data,and depending on the type of the email and the data stored,it then ends up creating a lead record on the system with captured data. I wrote this in Apex,which is a proprietary language of Salesforce and is based on JAVA. Apex should adhere to JAVa implementation of Regex.Some thimgs are still different.The website for regex calculations was showing one info,java docs other,and Salesforce something else all together...It took me a while..


Nice resource. You do start off making it a bit daunting:

>For most people without a formal CS education, regular expressions (regex) can come off as something that only the most hardcore unix programmers would dare to touch.

my experience wasn't as daunting as this at all.

this is my story: basically every regex I ever wrote always worked the first time and I found it super easy. my intro was the Perl 5 "camel" book, i.e. Programming Perl. never heard of regexes before that.

If you find this current tutorial we're discussing tricky or daunting, maybe give the resource I just mentioned ("Programming Perl" for Perl 5) a go because for whatever reason the explanation was super simple and writing a regex was one of the easiest things I ever did no matter what I wanted to do. I didn't know about the concept itself before I read that book. I've used it to parse loads and loads of things, use it in my editors, etc. It's always so easy.

I don't know if you can find the "Programming Perl" book, I tried to look and found something like that here: https://www.cs.ait.ac.th/~on/O/oreilly/perl/prog/ch01_07.htm

you can maybe ignore the stuff about Perl and still sort of follow the tutorial.

just an alternative in case someone had their curiosity piqued by this illustrated guide, but it doesn't quite "click" for them,, and wanted to see the same thing written out in another teaching way. And a note that for me it was always not daunting at all!


I also learned regex from that book and had the same experience as you, so although I can't remember how that book worked it's magic (it's been nearly 20 years) I can corroborate your recommendation.

I also found the various perldoc pages on regex to be very helpful resource that I referred to frequently back when I was writing perl. Notable pages are perlretut [0], which serves as an intro to regex, perlre [1] which goes into considerable depth, and perlreref [2] which is a handy quick reference for day-to-day work.

[0] https://perldoc.perl.org/5.30.0/perlretut.html

[1] https://perldoc.perl.org/5.30.0/perlre.html

[2] https://perldoc.perl.org/5.30.0/perlreref.html


By any chance studied in Belgium, Ghent? Just curious for educational purpose :)


>By any chance studied in Belgium, Ghent?

No, I didn't. (at the time, bought the Perl book in the U.S., where I read it.)


Regular expressions may appear intimidating to the complete newbie, but once you grasp a few simple ideas I'd argue it is one of the easier topics in CS.


And incredibly useful. I don't know if that's just me and the things I do with a computer, but I feel like it's a knowledge I've used a million times in my life, both for my pet projects and my job as a law school TA. What baffles me is that it's a niche knowledge. Excel is insanely useful, but it's considered a basic knowledge, or at least everyone has heard of it. Not regex.


Agreed. I teach a session on it every year to Economics PhDs and faculty — if I can only get them to come. It's something they just don't know they don't know.

http://mackerron.com/text/text-slides.pdf#page19


Wow, I just wanted to say thanks for sharing this. These slides are great. Would you mind if I shared this link with some coworkers?


Share away!


A small collection of gotchas to regexes: https://www.rexegg.com/regex-gotchas.html


Shameless plug: I've written books [1] which focuses entirely on the specific flavor supported by the tool or the programming language. I use plenty of examples to present the features one by one. These can be used for learning purposes as well as handy reference guide.

I think of regular expressions as a mini-programming language, and use loose fitting analogies

    Anchors --> adding if conditions
    Alternation --> Conditional OR
    a(b|c)d = abd|acd --> a(b+c)d = abd+acd
    Quantifiers --> string repetition and range operators
    Dot metacharacter + quantifiers + alternation --> Conditional AND
    Capture group and backreference --> variable
    Subexpression call --> function
    Character class --> sets
    Lookarounds --> custom conditionals
    Flags --> like command line options
So far, I've written books for Ruby and Python regular expressions, and BRE/ERE/PCRE/Rust for "GNU grep and ripgrep". Currently writing a book for GNU sed.

[1] https://learnbyexample.github.io/books/


I consider myself to be fairly well versed in regex having used it for years, but a recent task to match only stored procedure definitions including the optional block comments immediately preceding them (if any) in large SQL scripts proved completely exasperating. I was deep in the weeds trying to get it to do optional positive look-behind assertions before I just gave up and wrote some code to do it.

The gory details of how to just match block comments are best outlined by this blog post detailing one kindred spirit's journey through the wilderness: https://blog.ostermiller.org/find-comment, although this wasn't particularly helpful to my situation, it was reassuring to know that I was not alone and might prove interesting to the crowd here.


Regex named capture groups are somewhat of an underused feature I find, especially with things like simple parsers they help to keep stuff readable https://regular-expressions.mobi/named.html


The same is true for free-spacing mode and comments, which can improve the readability of regular expressions a lot: https://www.regular-expressions.info/freespacing.html


Has anybody seen an introduction to regexes that starts with finite automata?

I was introduced to them that way and it seemed very intuitive to me. I’m not sure if I just have a head for regular expressions, or if the way they were presented was especially good.


NC State's second semester CS course uses a regex model as an example to Finite State Machines; however, I think it's only an example, not a deep dive into it


My university (UC Riverside) does this. I haven't taken that course specifically as it's not required for my major, so I can't speak to any specifics.


I've taught a few "intro to regex" classes and have settled on a very simple explanation that seems to work even with fairly non-technical audiences. In short:

1) Regular expressions and "regex" are different things. Regex is a superset of regular expressions. But if you can understand regular expressions you can understand a huge number of regexes and have the tools to understand the rest of the syntax that the system uses.

2) There are many many many regular expression and regex dialects and engines. It's useful to learn the dialect you are trying to use.

3) Originally, regular expressions were used to describe a generator of strings. It can be helpful to approach writing regexes from this standpoint. It's not about what it matches, but what strings the regex can produce. I've had many students go from bewildered incomprehension to immediate eureka when I describe this.

4) There are only three rules that the beginning regex user has to know:

i) Concatenation - you can make a bigger generator by combining one generator after the other. For example: the regex 'a' and the regex 'b' can be concatenated into 'ab' and produces only the string 'ab'.

ii) Alternation - you can specify one character or another using the pipe character in most dialects '|'. For example the regex 'a|b' produces either the string 'a' or 'b'.

iii) Repetition - the character ' * ' is a postfix operator that means that the character that preceeds it can be repeated zero to an infinite number of times. For example: 'a* ' produces the empty string, 'a', 'aa', 'aaa' and so on.

5) By combining these (i.e. using concatenation) and the grouping operators '(' ')' you can write a regex for almost anything.

6) Nearly all of the other character you see in regexes that don't make sense are simply syntactic sugar that make it easier and shorter to write common expressions in a more compact form. Examples:

Writing lots of alternations of single characters (called character classes) - 'a|b|c|d|e|f' can be written as [a-f] - an expression that is the alternation of every character except the newline character can be written as '.' There's a whole host of these kinds of things.

Writing lots of different kinds of repetitions introduces lots of new operators - 'aa* ' can be written as 'a+' where '+' means 'repeat the previous character 1 or more times - 'a?' means 'repeat the previous character 0 or one times' - 'a{1,5}' means 'repeat the previous character 1 to 5 times' There's also an entire table of these things.

After that it's mostly practice and referring to the description of the operators for the dialect you are using.

I usually cover capture groupings (e.g. \1, \2, \3 or $1, $2, $3) later on if necessary as it requires people to have a good handle on the particulars of their languages way of handling those things. At this time I'll also over the '(?:.)' non-capture group operator if their dialect allows.

About as frequently I'll have to introduce '^' and '$' and begin and end string operators and also deconflict it from '[^.]'.

All the crazy-ass back/forward reference stuff I usually leave out as people almost never need or come across those except in pretty rare instances. In those cases they usually have enough regex under their belt that it's just another topic to pick up.

This doesn't get anybody to pro-wizard level and there's always edge cases and weird Perl-golf stuff that's out there, but it'll get most people to generally functional inside of a day or two and I'll just make myself available to spot answer questions or remind them where they forgot something.


Very nice write up, and easy to read. The author does a great job explaining the concepts. I do wish he would have dove a bit deeper though:

> [0-9][0-9]* (what this pattern matches is left as an exercise to the reader)

Some more real world examples to bring all the concepts home would have been great.


The pattern is kinda nonsense. The first part [0-9] matches one digit in the given range 0-9. The second part obviously the same but the * quantifier means the set can occur 0 or many times.

The whole pattern therefore means one digit 0-9 followed by zero or many digits in the range 0-9.

so "3" would match "21" would match "123456" would match "" would not match

The obviously "correct" pattern for this behavior would be [0-9]+ The + means 1 or more and thus makes repeating the set obsolete.


Author here. I actually did not know this. I used this very old book (Unix Power Tools) and after finishing the chapter on regex I didn't recall seeing a `+` operator (maybe I forgot to add it or skipped over it). But indeed the `+` operator seems to make significantly more sense here.

Thanks!


No problem =) I was just a bit confused myself as I saw that pattern an it didn't instantly make sense why someone would write that. If you often use regex like I do. You kinda instantly see that something is weird if pattern have repeating sub pattern. It basically should never happen if the pattern is simplified.

You may wanna have a look at the PCRE flavor if you understand this you can work with all others (you'll only gonna miss some features)

The + is really just a shortcut for {1,} means 1 or more greedy the * is a shortcut for {0,} means 0 or more greedy Also while I'm at it the "even more correct" pattern ofc would be \d+ (at least in PCRE) \d stand for digit it shortcut for [0-9] Personally I would never write [0-9] I only ever use sets for partial ranges i.e. [5-9] or [a-f] Whenever you need the "full" set there is usually a short way to write it. This makes pattern shorter and easier to read. Another nice trick is to uppercase the shortcuts to invert it's meaning so \D would match any non-digit.


'+' isn't a part of BRE syntax. It stems from ERE and all of its offshoots.

https://www.gnu.org/software/sed/manual/html_node/BRE-syntax...


The "+" only saves you a few keystrokes. Using "*" is also fine and not wrong, but a little more error prone, so I would note to better use "+" in a peer review.


Exactly, it mostly about readability which is often (not always) better if the shortest possible pattern is used.

[0|1|2|3|4|5|6|7|8|9][0|1|2|3|4|5|6|7|8|9]{0,} and [0-9][0-9]* and [0-9]+ and \d+ all do the same so they are not wrong as in the result will be correct.


Genuine question. Why would you write a guide for 'noobs', when you don't know the basics yourself? It seems kind of silly.


I seem to have deleted part of the title. The article is titled

>Regex For Noobs (like me!) - an illustrated guide

So I am painfully aware that I'm a noob too, I even call myself one! I wrote this little piece because I felt that regex can be kind of intimidating for a lot of people but that it's actually pretty OK and fun once you get the hang of it. To get people over that initial barrier is what I aim to do with the "guide".


People who are new to a topic and just learned a new thing often do a better job at explaining things without requiring prior knowledge. Experts in a topic often see things as trivial and write introductions that exclude a fraction of potential readers that lack a skill.

I can recommend the articles by Julia Evans[1], which are pretty much like this. She works on a problem that is new to her and describes how she solved it or how she learned something new.

[1]: https://jvns.ca/


It's a great way to reinforce your own learning, share what you just learned, and get great feedback (many of the comments in this thread)


Because noobs wouldn't know to find problems ;)


I'd also like to see a part 2 and perhaps a part 3. This is the clearest explanation of regex that I've seen - and I've seen quite a few!



I learnt regex when I started with PHP in the 2000s and subsequently when I learned Perl. When I started studying copmuter science at the university, I was supper baffled that there is a full theory of regular expressions: The one of a regular (formal) language (https://en.wikipedia.org/wiki/Regular_language).

I recommend everybody who's interested in "looking behind the curtains" of regexp, their powers and limitations, to dive in a crash course of formal languages (Chomsky hierarchy). It can be useful to gain a feeling what a regexp can do and what it cannot.


the Chomsky hierarchy isn't the actual relevant hierarchy for formal languages - it's the polynomial complexity hierarchy

https://en.wikipedia.org/wiki/Polynomial_hierarchy


I can't memorize regex completely but my favorite cheat sheet for regular expressions is this simple one:

https://github.com/ashchan/cheatsheets/blob/master/misc/regu...

For more advanced expressions I just do a quick google search for existing solutions before attempting to recreate the wheel.


If someone wants to practice doing regexes a bit, you can try out regexcrossword.com or my app (based on the site's idea, but I'm not affiliated to them) https://play.google.com/store/apps/details?id=de.chagemann.r...

Note: this is a new account with my real name, I used to have an account here already.


I love the style of the pencil highlighted handwriting!


Author here. Thanks a lot for the kind words, they really do mean a lot to me :)


Wonderful intro, among the best if not the single best basic introductions I've seen.


Author here, thanks a lot. Your comment means a lot to me!


Throw in + and I think this covers close to 100% of things I've ever had to grep. Very nice little tutorial/refresher.


Nice write-up, would love something like this for more advanced concepts such as lookahead and condition.



Word of caution: Regex is hard to maintain. They are harder to (re)read than C-Style languages.


Regex can be as easy to maintain as code, if you treat them like code:

1. Don't put it all on one line

2. Indent nested constructs

3. Write comments

See for example Python's verbose regex syntax: https://docs.python.org/3/library/re.html#re.VERBOSE


The verbose mode helps a bit. I did a quick check and C# seems to have tricks to accomplish "verbose" RE.

But still, if you have to comment RE to improve readability is that really an improvement or one more thing to maintain?

My preference is to avoid RE in larger code bases.


If there's a 'suggestion box', I suggest to make part 2 about greediness


Nice guide. I would have liked a little more detail. But this is great as it is! Thanks


"sufficiently advanced enough"? Did Arthur C. Clarke really say it this way?


i remember seeing some japanese anime-style programming guides but never thought anything of them back then. i really miss these short illustrated guides but i don't recall what company was publishing them.


Perhaps No Starch Press? https://nostarch.com/mg_databases.htm


thank you so much. those were the books!


I've never memorized any regex. It's so syntax dense without many intuitive characters. Not sure this is a problem. They're basically write only. Please always comment beside regexes what they intend on doing.


Also very helpful to have at least a few good unit tests to show what they should be doing


The handwriting is not easy to read.


Author here, thanks for the feedback. I'll make sure the handwriting is clearer next time. Thank you!


Best regex intro I’ve read


Well, this is a very good way of teaching people how to double their problems effectively.


Context: Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems. — Jamie Zawinski

See also: https://blog.codinghorror.com/regular-expressions-now-you-ha...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: