Regex is one of those topics I just can't seem to commit to memory. The concept is easy enough, and every time I need one I can usually work out what to do after a short amount of research, but I feel like I forget immediately afterwards. It may have to do with how regexes of any complexity look like machine code to me.
In addition to the 'practice' stated elsewhere, the other thing someone needs to really get to grips with something like regex is a 'reason'.
I never really got the point of regexes until I discovered capture groups. Pretty much everything I use them for is based on some problem like:
"Change date format from 'DD-MM-YY' to 'YYYY-MM-DD'"
"Keep only lines in the log file with a date in them"
"Swap the first name and last name of every contact in this list which has a comma in it"
"Rename all the files in this folder to strip out all the gubbins and put the useful stuff in a standard format"
And so on. You need a problem where doing it any other way than regex would be lengthy and painful and the regex itself is light and breezy. You build up little snippets or "phrases" that you can re-use easily to solve new problems.
Regexone (https://regexone.com/) is one of the best online tutorials I’ve come across for anything. I got a decent grasp of Regex after going through the series 1-2 times.
I haven't seen it mentioned here yet, but wanted to throw out (https://regexr.com/) as a decent resource into writing regular expressions. I think I haven't needed to use a tutorial because of how easy it makes writing them.
Strongly agree with this recommendation. I ran through regexone.com once, and even though I already knew regexes, it helped me fill in some stuff I had gotten rusty on.
I highly recommend the lessons. If you do them, I also highly recommend donating the recommended $4 via Paypal (or more), although it's technically free to use. It's well worth it.
Practice, my friend. You might have developed a systematic methodology for quickly learning and internalising (many) other things. (Kudos.) But sometimes it just comes down to wrote practice and repetition; time and effort might be key here for your mastery of this subject.
>time and effort might be key here for your mastery of this subject.
It doesn't have to be this way. If most programming languages were designed like regular expressions they would be unusable. I can spend a year w/o writing a single line of Pascal or C++ and then it would come back immediately. I occasionally found regular expressions useful in my work, but only a couple of times could justify spending time on learning a little bit of it only to forget the next day. In a few other cases I simply googled a solution and used w/o understanding it. Most often I just write code instead, it is verbose but at least it's clear what it does.
I suppose what I'm trying to say is that from personal and anecdotal experience I got 97% in a first year calculus course. I went in with a marginally good algebra result, but I resolved to learn the shit out of everything that came my way.
I systemised as much as I could, but there was a lot of write practice when it came to learning trigonomic idendities and transidnetal functions. And not just the normal ones; I mean all of them ... both the inverserse and hyperbolic identities.
I did all of the problems and got extra ones.
In short, I knew the material backwards and forwards; I finished hour long exams in 20 minutes, and was usually the first or second person to leave.
What I am trying to say is that REGEX is similar to calculas and in my view requires significant focus and practice to master. Or you can muddle your way through it when you have to.
To add to that, regex mojo has just kind of built up in my head over years. It's not something a topic I was able to study, learn, and retain (well). Every time I stumble onto a regex use case I need to understand & don't, I level up at that time, and that's always been totally fine.
A few months ago I needed to add a regex for validating IPv4 addresses that were in a non-standard format. I had to do a little research, then I wrote & tested it, and promptly forgot 90% of what I had just learned.
The problem I have with learning regex is similar to the above. It's easy enough to (re)learn when you need it and it doesn't come up often enough for me to consider actually learning it as being useful.
That was how I was as well until I got tasked with making feature rich text inputs (decorate hashtags, user tags, replace names with emails, vice versa). At a certain point it gets too difficult to respond to user input as it is just too variable (user highlights 40 chars and presses ‘A’? user pastes 100 chars?). A bulletproof method is running find and replace when needed and the ideal way to do that is regex.
Forcing myself to get tough regex working is what really made me grasp some of the more advanced concepts that unlock regex. It’s mainly just that we never need regex, and if you use it for a simple task you’re a jerk for introducing unnecessary complexity. Why would people learn it?
It's because it's terse with quite a low amount of mnemonics (a meaningful feature is only 1-2 characters, there's no padding for a spacial understanding of hierarchies, etc.., compared to your host language), and even though it comes in very handy, most developers aren't spending a large amount of time writing them in your average project.
I learned first about Regular Expressions i.e. the finite automata that do a simpler subset of what Regex can do. Learning about this and drawing out the graphs etc, may help commit to memory.
Also start using them to help you find and replace stuff in a text editor. They are very handy for converting for example a csv into code.
> It may have to do with how regexes of any complexity look like machine code to me.
Regexes get a bad rap because programmers who write otherwise maintainable code throw code hygiene out the window when writing long regexes. Your host language is much more complicated than the regex language, but writing shitty unreadable regex is, for whatever reason, acceptable. Some unfortunate and unnecessary limitations of typical regex libraries make the problem worse.
1. Regexes are typically written as one-line strings instead of as structu/red code. When writing C/Java, programmers understand that they should put each element of a sequence on a separate line and use indentation to visually signal branching and loops. But for some reason when writing character-processing programs that also have sequences, branching (|), and loops, programmers almost always golf it and put the whole complex program all on one line.
If you're writing a regex longer than 5 or 10 atoms, place each sequence on a separate line and use newlines+indentation to visually offset disjunctive choices (|) and loops (star).
Rule of thumb: you should be able to roughly sketch the rough shape of the finite automata by crossing your eyes and eyeballing the shape of the regex, just like you should be able to roughly sketch the control flow of a program by crossing your eyes and eyeballing the shape of the code.
2. Character group names are too terse (e.g., \s instead of "\whitespace" and \d instead of \digit), probably because of #1. Also, very few people give names for long subexpressions (e.g., factoring code out into functions with well-chosen names).
3. Gotos/try...catch (i.e., backtracking) are not used judiciously and aren't well-documented/tested. I often see backtracking or confusing mixes of lazy and greedy matching instead of just writing out a slightly longer disjunction.
4. Regexes are often used for languages that aren't even almost regular. A bit of backtracking is OK (gotos are sometimes OK), but if there's a lot of backtracking then you need to use a different class of languages/machines.
5. There's no way to embed non-regular matchers into a regex.
Due to the combination of 1-5, matching an email address with an optional recipient name (so something like "asdf@asdf.com" or something like "John Smith <asdf@asdf.com>") you need an insane regex that many people implement with lots of backtracking and so on all stuffed into a single line.
But something like this would work just fine and is much more readable:
\emailAddress := ...
# todo: need to support dashes in names.
\name := (
[a-zA-Z]*
\whitespace?
[a-zA-Z]*
)
# matches asdf@asdf.com or John Smith <asdf@asdf.com>
(
\emailAddress
)
OR
(
\name
\whitespace?
<
\emailAddress
>
)
where e.g., the implementation of \emailAddress could be written as a stand-alone parser in the host language. But even without digging into email address you can already see how this is way more readable than:
Writing readable regular expressions shouldn't be difficult -- just treat the regex like any other piece of code and allow inter-op with the host language. But few people/libraries put in the effort, and for whatever reason golfing regexes in production code is considered acceptable even in orgs where you'd be fired for code golfing in the host programming language.
Python has another somewhat reasonable solution. Either of those solutions can be combined with good programming to constitute a reasonable solution.
> Personally I prefer though the terseness of regular expressions.
I think there are legitimate use-cases for both.
If you're quickly hacking out a small good-enough parser for something regular or "almost regular", terseness can be great.
However, if you're parsing a large regular language, terseness isn't really a benefit. Perl is on its bed for a reason, and overly terse regular expressions should die for a similar reason. Overly-clever write-only coding culture sucks.
But the terseness of regular expressions is basically terrible beyond maybe a few hundred characters. E.g., the following has no place in production code -- you might as well just include a binary:
One thing I still like about Perl is how well the regex syntax is integrated into the language, combined with being installed in just about every Linux modern Linux system,
makes it really accessible from your fingertips.
If you're in a hurry and just need to do some one-off slicing and dicing, there's not much more powerful than
being able to say, eg:
import re
data = "foobarfoo"
data = re.sub('bar', 'baz', data)
print data
For some reason, the Perl =~ notation sticks in my head
much more easily than the Python method.
Of course you can string a bunch of sed commands together in a shell script as well, but I find that becomes unwieldy pretty quickly in a lot of situations.
In other languages a part of me always feels like if I'm using regex I'm doing something wrong, because I used it so liberally in Perl. It feels like a Perl-ism so I try to make a conscious effort not to carry those over as they tend to be the source of inefficiencies.
TBF for this specific case it's very much unnecessary to break out regexes: most languages do have some sort of literal replacement string API:
str.replace = replace(...)
S.replace(old, new[, count]) -> str
Return a copy of S with all occurrences of substring
old replaced by new. If the optional argument count is
given, only the first count occurrences are replaced.
(though the documentation turns out more misleading than re.sub's: this states it replaces all occurrences but really replaces all non-overlapping occurrences, which re.sub actually spells out properly).
Oddly enough I was asked once to determine if there were any differences between Perl Regex and Python's implementation in an effort to convert a tool I wrote into Python (so other developers could help)
I found only one very weird edge case worth noting, in Perl the regex "s/ a* / x /g" (spaces added to prevent formatting) will turn the string "bac" into "xbxxcx" because the "a" technically means zero or more so it matches the spaces between the characters, not so in python it creates the string "xbxcx" because it matched the a in between one time and didn't count the empty spaces. Slightly less accurate results since * does mean zero or more so the space between counts as zero characters.
That was changed recently, I get 'xbxcx' on 3.5 and 'xbxxcx' on 3.7
And coming to Perl vs Python regex differences, there are too many to count, Python's 3rd party module 'regex' is more comparable to Perl regex. For example, Python doesn't support possesive quantifiers, subexpression calls, \K, \G and so on
A question of taste, disposition and habituation, supposedly.
I far prefer the Python way for clarity and immediate intelligibility, no magic spells required.
I much prefer the Perl way for writing short disposable stuff and much the Python way if any other person (including me in more than 1-2 weeks) is going to use/modify/read the code.
The gap between perl and python seems cleaner than the gap between python and anything-statically-typed, at least.
I want to call out https://www.regular-expressions.info/ since it was, for a long time, one of the better/best resources on regexes that I was able to find. I learned a lot from this ... guy, essentially.
but, I always try to add a warning when recommending - should use them only for the flavors supported (PCRE, Python, etc) I've seen many using it for cli tools and wonder why things like non-greedy or lookarounds don't work.
Yes, two issues are a) knowing what to escape and b) what is supported or not. I was weirdly surprised to discover that there are some useful functionalities that are basically supported nowhere... except in vim. I think it was variable-length lookbehinds but don't quote me on this.
Lookbehind is restricted in many regex engines that support it (many don't at all). Some require the lookbehind to be constant-length (so if you have alternatives in there, they all have to have the same length, and you can't use quantifiers, basically). Some require it to be finite-but-known-ahead-of-time-length, so something like (?<=a{3,6}) is okay, but (?<=a{3,) is not. Also (?<=a|bb) would be okay.
.NET's System.Text.RegularExpressions.Regex is one implementation that has no restrictions on lookbehind. Having used PowerShell for so long it now happens sometimes that I forget when writing regexes for other implementations, as it's really convenient at times.
yeah, typically you would get error if you use variable length lookbehinds
a few do support it (for ex: Python's 3rd party regex module) and sometimes you could workaround with \K (similar to \zs in vim) [1]
And there are other frustrating differences between implementations, for example \g definition is very different between Ruby and Python, character set operations are not found everywhere, etc. Plus, BRE/ERE versions found in command line tools do not even support features like non-greedy and lookaround
My knowledge of Regex is basic.Last year,I ended up writing a piece of code,which was reading an inbound email,parsing some of the data,and depending on the type of the email and the data stored,it then ends up creating a lead record on the system with captured data. I wrote this in Apex,which is a proprietary language of Salesforce and is based on JAVA. Apex should adhere to JAVa implementation of Regex.Some thimgs are still different.The website for regex calculations was showing one info,java docs other,and Salesforce something else all together...It took me a while..
Nice resource. You do start off making it a bit daunting:
>For most people without a formal CS education, regular expressions (regex) can come off as something that only the most hardcore unix programmers would dare to touch.
my experience wasn't as daunting as this at all.
this is my story: basically every regex I ever wrote always worked the first time and I found it super easy. my intro was the Perl 5 "camel" book, i.e. Programming Perl. never heard of regexes before that.
If you find this current tutorial we're discussing tricky or daunting, maybe give the resource I just mentioned ("Programming Perl" for Perl 5) a go because for whatever reason the explanation was super simple and writing a regex was one of the easiest things I ever did no matter what I wanted to do. I didn't know about the concept itself before I read that book. I've used it to parse loads and loads of things, use it in my editors, etc. It's always so easy.
you can maybe ignore the stuff about Perl and still sort of follow the tutorial.
just an alternative in case someone had their curiosity piqued by this illustrated guide, but it doesn't quite "click" for them,, and wanted to see the same thing written out in another teaching way. And a note that for me it was always not daunting at all!
I also learned regex from that book and had the same experience as you, so although I can't remember how that book worked it's magic (it's been nearly 20 years) I can corroborate your recommendation.
I also found the various perldoc pages on regex to be very helpful resource that I referred to frequently back when I was writing perl. Notable pages are perlretut [0], which serves as an intro to regex, perlre [1] which goes into considerable depth, and perlreref [2] which is a handy quick reference for day-to-day work.
Regular expressions may appear intimidating to the complete newbie, but once you grasp a few simple ideas I'd argue it is one of the easier topics in CS.
And incredibly useful. I don't know if that's just me and the things I do with a computer, but I feel like it's a knowledge I've used a million times in my life, both for my pet projects and my job as a law school TA. What baffles me is that it's a niche knowledge. Excel is insanely useful, but it's considered a basic knowledge, or at least everyone has heard of it. Not regex.
Agreed. I teach a session on it every year to Economics PhDs and faculty — if I can only get them to come. It's something they just don't know they don't know.
Shameless plug: I've written books [1] which focuses entirely on the specific flavor supported by the tool or the programming language. I use plenty of examples to present the features one by one. These can be used for learning purposes as well as handy reference guide.
I think of regular expressions as a mini-programming language, and use loose fitting analogies
Anchors --> adding if conditions
Alternation --> Conditional OR
a(b|c)d = abd|acd --> a(b+c)d = abd+acd
Quantifiers --> string repetition and range operators
Dot metacharacter + quantifiers + alternation --> Conditional AND
Capture group and backreference --> variable
Subexpression call --> function
Character class --> sets
Lookarounds --> custom conditionals
Flags --> like command line options
So far, I've written books for Ruby and Python regular expressions, and BRE/ERE/PCRE/Rust for "GNU grep and ripgrep". Currently writing a book for GNU sed.
I consider myself to be fairly well versed in regex having used it for years, but a recent task to match only stored procedure definitions including the optional block comments immediately preceding them (if any) in large SQL scripts proved completely exasperating. I was deep in the weeds trying to get it to do optional positive look-behind assertions before I just gave up and wrote some code to do it.
The gory details of how to just match block comments are best outlined by this blog post detailing one kindred spirit's journey through the wilderness: https://blog.ostermiller.org/find-comment, although this wasn't particularly helpful to my situation, it was reassuring to know that I was not alone and might prove interesting to the crowd here.
Regex named capture groups are somewhat of an underused feature I find, especially with things like simple parsers they help to keep stuff readable https://regular-expressions.mobi/named.html
Has anybody seen an introduction to regexes that starts with finite automata?
I was introduced to them that way and it seemed very intuitive to me. I’m not sure if I just have a head for regular expressions, or if the way they were presented was especially good.
NC State's second semester CS course uses a regex model as an example to Finite State Machines; however, I think it's only an example, not a deep dive into it
I've taught a few "intro to regex" classes and have settled on a very simple explanation that seems to work even with fairly non-technical audiences. In short:
1) Regular expressions and "regex" are different things. Regex is a superset of regular expressions. But if you can understand regular expressions you can understand a huge number of regexes and have the tools to understand the rest of the syntax that the system uses.
2) There are many many many regular expression and regex dialects and engines. It's useful to learn the dialect you are trying to use.
3) Originally, regular expressions were used to describe a generator of strings. It can be helpful to approach writing regexes from this standpoint. It's not about what it matches, but what strings the regex can produce. I've had many students go from bewildered incomprehension to immediate eureka when I describe this.
4) There are only three rules that the beginning regex user has to know:
i) Concatenation - you can make a bigger generator by combining one generator after the other. For example: the regex 'a' and the regex 'b' can be concatenated into 'ab' and produces only the string 'ab'.
ii) Alternation - you can specify one character or another using the pipe character in most dialects '|'. For example the regex 'a|b' produces either the string 'a' or 'b'.
iii) Repetition - the character ' * ' is a postfix operator that means that the character that preceeds it can be repeated zero to an infinite number of times. For example: 'a* ' produces the empty string, 'a', 'aa', 'aaa' and so on.
5) By combining these (i.e. using concatenation) and the grouping operators '(' ')' you can write a regex for almost anything.
6) Nearly all of the other character you see in regexes that don't make sense are simply syntactic sugar that make it easier and shorter to write common expressions in a more compact form. Examples:
Writing lots of alternations of single characters (called character classes)
- 'a|b|c|d|e|f' can be written as [a-f]
- an expression that is the alternation of every character except the newline character can be written as '.'
There's a whole host of these kinds of things.
Writing lots of different kinds of repetitions introduces lots of new operators
- 'aa* ' can be written as 'a+' where '+' means 'repeat the previous character 1 or more times
- 'a?' means 'repeat the previous character 0 or one times'
- 'a{1,5}' means 'repeat the previous character 1 to 5 times'
There's also an entire table of these things.
After that it's mostly practice and referring to the description of the operators for the dialect you are using.
I usually cover capture groupings (e.g. \1, \2, \3 or $1, $2, $3) later on if necessary as it requires people to have a good handle on the particulars of their languages way of handling those things. At this time I'll also over the '(?:.)' non-capture group operator if their dialect allows.
About as frequently I'll have to introduce '^' and '$' and begin and end string operators and also deconflict it from '[^.]'.
All the crazy-ass back/forward reference stuff I usually leave out as people almost never need or come across those except in pretty rare instances. In those cases they usually have enough regex under their belt that it's just another topic to pick up.
This doesn't get anybody to pro-wizard level and there's always edge cases and weird Perl-golf stuff that's out there, but it'll get most people to generally functional inside of a day or two and I'll just make myself available to spot answer questions or remind them where they forgot something.
The pattern is kinda nonsense.
The first part [0-9] matches one digit in the given range 0-9.
The second part obviously the same but the * quantifier means the set can occur 0 or many times.
The whole pattern therefore means one digit 0-9 followed by zero or many digits in the range 0-9.
so
"3" would match
"21" would match
"123456" would match
"" would not match
The obviously "correct" pattern for this behavior would be [0-9]+
The + means 1 or more and thus makes repeating the set obsolete.
Author here. I actually did not know this. I used this very old book (Unix Power Tools) and after finishing the chapter on regex I didn't recall seeing a `+` operator (maybe I forgot to add it or skipped over it). But indeed the `+` operator seems to make significantly more sense here.
No problem =) I was just a bit confused myself as I saw that pattern an it didn't instantly make sense why someone would write that. If you often use regex like I do. You kinda instantly see that something is weird if pattern have repeating sub pattern. It basically should never happen if the pattern is simplified.
You may wanna have a look at the PCRE flavor if you understand this you can work with all others (you'll only gonna miss some features)
The + is really just a shortcut for {1,} means 1 or more greedy
the * is a shortcut for {0,} means 0 or more greedy
Also while I'm at it the "even more correct" pattern ofc would be \d+ (at least in PCRE) \d stand for digit it shortcut for [0-9]
Personally I would never write [0-9] I only ever use sets for partial ranges i.e. [5-9] or [a-f]
Whenever you need the "full" set there is usually a short way to write it. This makes pattern shorter and easier to read.
Another nice trick is to uppercase the shortcuts to invert it's meaning so \D would match any non-digit.
The "+" only saves you a few keystrokes. Using "*" is also fine and not wrong, but a little more error prone, so I would note to better use "+" in a peer review.
Exactly, it mostly about readability which is often (not always) better if the shortest possible pattern is used.
[0|1|2|3|4|5|6|7|8|9][0|1|2|3|4|5|6|7|8|9]{0,} and [0-9][0-9]* and [0-9]+ and \d+ all do the same so they are not wrong as in the result will be correct.
I seem to have deleted part of the title. The article is titled
>Regex For Noobs (like me!) - an illustrated guide
So I am painfully aware that I'm a noob too, I even call myself one! I wrote this little piece because I felt that regex can be kind of intimidating for a lot of people but that it's actually pretty OK and fun once you get the hang of it. To get people over that initial barrier is what I aim to do with the "guide".
People who are new to a topic and just learned a new thing often do a better job at explaining things without requiring prior knowledge. Experts in a topic often see things as trivial and write introductions that exclude a fraction of potential readers that lack a skill.
I can recommend the articles by Julia Evans[1], which are pretty much like this. She works on a problem that is new to her and describes how she solved it or how she learned something new.
I learnt regex when I started with PHP in the 2000s and subsequently when I learned Perl. When I started studying copmuter science at the university, I was supper baffled that there is a full theory of regular expressions: The one of a regular (formal) language (https://en.wikipedia.org/wiki/Regular_language).
I recommend everybody who's interested in "looking behind the curtains" of regexp, their powers and limitations, to dive in a crash course of formal languages (Chomsky hierarchy). It can be useful to gain a feeling what a regexp can do and what it cannot.
i remember seeing some japanese anime-style programming guides but never thought anything of them back then. i really miss these short illustrated guides but i don't recall what company was publishing them.
I've never memorized any regex. It's so syntax dense without many intuitive characters. Not sure this is a problem. They're basically write only. Please always comment beside regexes what they intend on doing.