The #regex IRC channel had an IRC bot with a quiz with 28 levels.
All sensibility ended after level 14 or so. At that point it was just "how deep does the PCRE rabbit-hole go?"
But there was a lot of useful, non-trivial stuff, too. Most specifically, look-aheads/lookbehinds, non-greedy matching, back-references, named capture-groups, character classes, anchors,
When I learned jq, I went much the same way: Started hanging out on #jq IRC channels and started trying to answer jq questions on StackOverflow. Sadly, I got outperformed the first six months, until it finally clicked.
The resources from Jan Goyvaerts / Just Great Software are great! His guides and to some extent tools is how I learned it too. Today I tend to often be the Regex go-to guy among colleagues, all seemingly because I learned to properly get the hang of the basics via Jan's resources.
I read Mastering Regular Expressions by Jeffrey Friedl 20 years ago when I was in middle school, front to back, and it's probably been the best investment of reading time I've ever made.
I had read on some tech site, some years ago, that Friedl worked at Yahoo for a while, and IIRC in a role involving a lot of text munging, which would probably have involved a lot of regular expression usage, maybe across the many web properties they had in that period, which included Yahoo Search, Yahoo Mail, Yahoo Groups, Yahoo Finance, and many more.
Found that interesting.
I had bought his book around that time or sometime later, but never read it fully, partly because I used to go cross-eyed from reading the text with all the italics and other highlighting (of the regexes in action) in a small font, which was probably needed to explain regex concepts, but still ...
but after reading this thread, I feel motivated to dive into regex again, at least at the shallow end of the pool, although I have dabbled in it and used it now and then in my work, before now.
I mean, the highlighting was maybe useful, but having it in the small font made it unnecessarily difficult to read. squiggly italics of say, one or two characters in length, are harder to distinguish from non-italic characters, when in quite a small font.
If I remember correctly, the local public library was selling books off for ultra cheap for some reason, and I added it to the pile of books my parents bought to fill the empty built-in bookshelves in our new family home (along with a bunch of reader's digest anthologies). I scanned through the first part of it and it seemed super powerful (I was just getting into programming at the time), so it captivated me.
I'm really surprised by the low quantity of people who learned by trying and instead read whole books or manuals. Had some code or whatnot that needed mass-replacing and used the built-in RegEx find-and-replace (I think it was EditPlus in those days). First learned how to match the exact string then extrapolate from there using {}, (), replacing, etc. It's a lot easier to learn when you need to solve a practical, immediate problem.
That's how I learned. I'm a self taught dev so I pretty much just took the same approach. Read documentation, try it out, read more docs, try it out, read some examples, search for how other people do it, etc. At a certain point, you just know it and can solve 90% of your needs without looking at docs. Although, tbh, I haven't written a very complicated regex in probably a decade and would need to do some warm up reps if I needed to today.
I agree you don’t need “whole books or manuals”, but how do you learn by trying alone?
The search space is enormous, and even if you stumble on a code fragment that appears to work, how do you know your code actually does what you think it does, and how do you know there isn’t a more efficient or readable way to do what you want to do? Case in point, you wrote:
> then extrapolate from there using {}, (), replacing, etc.
How did you find out about those, if not from reading (likely followed by some trying to check your understanding)? I think you have to read, not “whole books”, but ‘just’ the right documentation, where ‘right’ depends on the tool you use. For example man regex may be sufficient. That, you can read in a few minutes.
Yeah this is wild to me, maybe it’s a generational thing? I never “learned” regex. I’ve written hundreds of them but I figured out what I needed and then I moved on.
"Learned" in University but it wasn't until Jeff Friedl's Perl Conference talks that I really became one with the regex engine. He taught you how to think like the regex engine and thus how regular expressions would be interpreted and thus how to write them. Then I got a master class in RE from Tom Christiansen when we were writing the Perl Cookbook.
Jeff wrote "Mastering Regular Expressions", which grew from that talk. You probably want a copy even though it was first released in 1997. For the mindset of RE, you can't beat it.
Learning REs is a roll through:
* how matching happens (advancing, matching, backtracking)
* using * ? and {} to match repetitions
* greediness and stinginess within the RE
* character classes, both [manual] and escapes like \s \W etc
* anchors and "what a line is"
* grouping and backreferences
* accessing groups outside the RE
* substitution and access backrefs in substitutions
* find ALL the matches
* complex parsing (just don't, it's rare not to regret it)
and then it's an absolutely epic deep-dive into the minutiae of what line starts and ends might be, Unicode and regex, code to be executed from within the regex enging, using code to BUILD regex and worrying about when escaping happens or doesn't, denial of service regex, etc. that will take you through ASCII, various Unix tool chains over time, and a bunch of other fun stuff.
I need to build a Regex a couple of times a year, and have always wondered whether others learn it and store it in their brain-cache, or whether they too need to look it up each time.
These days, I ask Claude/ChatGPT to create the regex and usually I know enough to be able to verify it. To double check, I'll start a new conversation and ask it what the regex does and verify it that way.
You can also ask it to create unit tests with edge cases. It might not catch every edge case, but usually it will create edge cases that you might not think of when writing unit tests yourself.
Learned regex in the 90's from the Perl documentation, or possibly one of the oreilly perl references. That was a time where printed language references were more convenient than searching the internet. Perl still includes a shell component for accessing it's documentation, that was invaluable in those ancient times. Perl's regex documentation is rather fantastic.
A simple way to test a regex you're building is this website, which offers immediate parsing and documentation of your regex, lets you test it against various inputs, and lets you choose which language's regex parser you are targeting.
Practice, the more you use them the easier they become. I never studied them but knew when to use them, then just tinkered and iterated until the pattern did what I needed it to. After a while you can mostly just write and read them without much tinkering.
So, you can observe what kind of state machine is produced from any given Regular Expression. You can also use it to merge and such manipulate state machines, or simplify Regular Expressions.
Easy, I learn it every time I unfortunately need to use it through painful trial and error, and searching. Thankfully the are online evaluators now, but now you need to figure out which regex is being used.
Then I forget it, and have unreadable mystery functions laying around that I hope don’t have bugs.
But at least it’s a single line!
Seriously though, my actual need for them is low, so I avoid the things as much as I would avoid inlining assembly.
That is a hard question because there are so many ways that one can understand regex. I learned how to read and use them using Unix tools like sed, but I think that my path to starting to understand them probably began with papers like "Regular Expression Matching Can Be Simple And Fast" by Russ Cox (https://swtch.com/~rsc/regexp/regexp1.html), well after feeling like I was pretty good at using them.
Then, as an expert in linguistic morphology, I started learning about things like subregular languages, as talked about in works such as Aural Pattern Recognition Experiments and the Subregular Hierarchy, by Rogers and Pullum (https://www.cs.earlham.edu/~jrogers/JoLLI.pdf). And I continue to wonder what the relationship is between these classes of languages and word formation.
Piece by piece, googling "how to do X in regex". But that was slow and didn't have a great foundation.
Then I learned Perl and started learning RegEx properly. Now somehow I've turned into one of those wizards I admired in the Stack overflow answers section. It wasn't until I had to teach RegEx to a junior that I realized how far I'd come.
One of the things I remember being difficult at the beginning was the subtle differences between implementations, like `^` meaning "beginning of line" in Ruby (and others) but meaning "beginning of string" in JavaScript (and others).
If you're just starting out, it'd be helpful to read about how a regex engine evaluates an expression against a string so that you can understand the "order of operations" and how repeating elements are matched.
For me the biggest hurdle was learning what they were 'for' and that took a long time. The real magic for me was capture groups - I could now suddenly see why you'd have a regex and not just string matching.
Then it was about knowing a situation or a problem when regexes would apply and knowing how to look up the things I needed to solve that problem. Some regex 'phrases' are good for grepping, others for find and replace. Some will help you swap names around, some to reformat phone numbers.
After a while the phrases give way to general understanding and certain things become fluent.
I still only really write short or basic regexes, but I use them all the time in editing text or doing things that are a little bit complicated but actually a short regex just turns it from a hard problem into an easy problem.
Start with https://regexone.com/ fun puzzle style interactive tutorial to grasp the basics.
After that it's the matter of either using it with your CLI tools or applying it to problems you are working on.
My first jobs were heavily focused on parsing data from HTML and regex was (and still is) the most common solution for the majority of cases
To learn it, I played a lot of regex golf [1]
I also enabled regular expressions in my code editor's Find feature so every search I'd make used regex. Having it enabled in my editor made learning it more immersive and useful, especially when combined with things like find-and-replace. I highly recommend permanently enabling that in your editor as well
Also, challenge your coworkers to see who can make the shortest patterns for a variety of cases and see whose is the most versatile. It's always a fun time
That's my go-to these days, but sometimes I like to see a diagram from this one: https://regexper.com
I've just slowly learnt it by experimenting with it over the past few years. People have mostly mentioned matching, but I use it more for string manipulation.
I'm still not as intermediate a programmer as I'd like to be, so it's great when I need to invert a design decision for example. A similar code structure in multiple places, maybe across multiple files. It also means I don't miss anything, like I would if I did it manually.
Regex the "env specific variant" or regex the concept (as it applies to theory of computation) etc?
The former I can never remember beyond the basics (*, +, ?, |). Even the | I go extra cautious and put in tons of parenthesis. If I ever need matching and grouping I resort to rtfm.
Now that latter, that's the more interesting and fun one!! Learnt it in college decades ago but really drilled it in by reimplementing Russ Cox's amazing Thomson nfa blog and breakdown in typescript!
Emacs has two packages called rx and xr. They allow you to write regexes in an sexp syntax and translate between that and conventional regex syntax. Furthermore you can define regex snippets and compose them into new ones. This more than anything can give you a handle on how a regex is composed.
For example
"\\`\\(?:[^^]\\|\\^\\(?: \\*\\|\\[\\)\\)"
can be written as
(: bos
(| (not (in "^"))
(: "^"
(| " *" "["))))
Emacs also has other features to highlight matches and groups to help understand regexes better.
I learned regex by writing an online poker hand history converter. I used to refer to it as spaghetti php code, but I've come to realise it was just newbie functional php wrapping a stack of regex.
Mid way through my 20 year career I realised that every job I'd had really came down to parsing data and outputting something a company finds value in. It's regex all the way down.
Develop useful tools that need complex parsing, I gathered data from websites using regex to make automated tools, for example getting flat offers from my city and making a tool to notify me.
Simply try to parse some complex information like movie strings, as an exercise you can try to parse these movie names to produce a result like this.
```
{
"name": "Dawn Of The Planet of The Apes",
"year": "2014",
"resolution": "1080p",
"codec": "h264",
"source": "web-dl",
"audio": "AAC5.1",
"group": "RARBG"
}
```
I spent half a day playing with https://regexone.com and got the fundamentals in place, after that it’s been practice in solving tasks at work. Rubular is awesome if you’re looking to test out a pattern
I learned regex incidentally from reading the classic book "Software Tools" (Kernighan & Plauger). It has a chapter which briefly describes the syntax but focuses on an analysis of the code used to parse & process them (in RATFOR).
I read through Learning Perl and practiced different regex patterns with simple programs. Programming Perl dove deeper, as did the OReilly Mastering Regular Expressions book.
That and practice. I frequently check them with online regex tools to make sure the regex does what I want before I implement them.
I was a Perl 5 magi and carried the O'Reilly book in hand for a few months. The impression has meant that every regex in 20 years was effortless, though Perl long forgotten.
Perl 5 regex familiarity seems like it futureproofed.
Now I suppose I mostly use JS or Vim which is such a subset.
Writing textmate grammars for VSCode extensions. I took over as the lead maintainer of the Godot engine VSCode extension, and one of the first things I worked on was adding syntax highlighting support for Godot 4's improvements to GDScript.
Textmate grammars are basically hundreds of nested regex snippets that recursively apply tags to regions of text. This is made worse by the fact that the grammar is written in JSON, so any escapes need to be double escaped, which means you can't easily copy your regex into something like regex101. You sort of just have to suffer until you get good at it.
I taught myself through trial and error by using TextSoap (https://textsoap.com/mac/) to make my legal drafting easier.
TextSoap has been around forever and must be the most underrated app on the Mac. It’s amazing — I rank it alongside Keyboard Maestro, if that tells you anything. It’s also available on SetApp. I can’t say enough good things about it.
If you get into it, there is an Alfred workflow that lets you search for and apply cleaners to selected text.
I learned by having to parse fields from log messages, in order to ingest log sources that aren't supported by the $SIEM at $job. Having said that, I typically learn regex, then forget regex, then learn regex and so on....
My first job involved a lot of web scraping and general data munging in PHP, so we accomplished it with a combination of XPath and regexes. Mostly XPath, with regex getting us through any particularly thorny bits of data we couldn't easily XPath out.
I had also done a tiny amount of regex in a college programming course, but really I didn't get "good" at them until I used them on the job.
I learned it in my "Programming Languages" class in university.
And then about 6 months later I had completely forgotten it.
It's one of those things you need to use regularly to keep it in memory. At least that's the case for me.
I tend to shy away from it these days for a lot of cases (ever try to regex validate an email??) but when I do use it I it's honestly just a process of re-learning for about 15 minutes each time.
If you have no experience, go through a tutorial to get the general idea.
Learn the rest "on demand" whenever you need it, it's not something to spend a lot of practice time on. Because if you don't use it a lot, you'll forget most of what you learned anyway, and if you do use it a lot, then you don't need to spend dedicated learning time, you'll get good quite quickly.
"Don't learn it" is generally bad advice for anyone seeking knowledge. Your suggestion will also lead to missing important edge cases where matches may incorrectly hit or miss on unexpected inputs. The basics of regex are really not that difficult to grasp anyways
Nevertheless, regex is something that there's little benefit in developing deep knowledge of - it tends to be use to solve specific pattern matching problems and there are great tools that will give you an answer to your problem.
Came here to say this. I always knew regex was something I was going to have to learn, and procrastinated for a long time because I dreaded it. I kept kicking the can down the road until one day ChatGPT came along and solved that problem permanently.
I personally don't work much with regexes, so usually, when I need one, I ask ChatGPT to make one, and it comes back with something good.
FWIW: I've seen some regexes used as a source of truth: The regex both processed data, and then determining what settings a user created required parsing the regexes. I quickly refactored that code to store what the setting was.
The most common uses in JavaScript are in the RegExp test method and the String replace method. The replace method is cool because it can receive a function as the second argument and the argument of that function is the value matched by the RegExp that can be modified and returned.
By trying to extract a C (pre 20) header and its comments by hand to generate some AsciiDoc documentation from it. Ended up refining the comment "format" that was already in use, parsing it with a python scripts and various Regex to then generate coherent documents.
Having a series of data extraction/alteration tasks that regex made really easy. Regexr.com is a great playground for figuring it out, but having to use it in a practical way in my day to day for about a year cemented that skill.
We took the academic approach in college, Kleene closures, Chomsky grammars etc. Once the meaning behind the line noise of symbols was clear it become fairly easy to write them.
Using vim to edit large dictionary files.
I forked a steno dictionary for Plover, and edited its 300k entries in vim. I've ben using regex daily since then.
Mainly in a job that forced me to use them (data mining). There are interactive tools that help you visualize what is going on, they are immensely helpful.
For context, AFAIK, I maintain the largest Textmate (read as "regex document") grammar in the world; the C++ textmate grammar for VS Code. (Don't mistake that as bragging, its a literally-unfixable dumpster fire) It has pretty much forced me to regularly use every regex feature, from recursive named backreferences to atomics and time complexity of lookahead's combinatorial explosions.
https://regexr.com/ is one of the most amazing interactive resources, I can't recommend enough. Back in the day I used it to go from beginer to intermediate. And while I never used this next site to learn, https://regex-vis.com/ is a great place to check out. From intermediate to master I've pretty much relied on rexegg.com/ for discovering the advanced stuff and engine differences. After that https://regex101.com/ was helpful for performance analysis. I first learned regex just mucking around in the CLI with some guidance from a programmer friend. Pure trial and error learning.
While I am inclined to say "the only way to learn regex is to use it", after reading the comments I must agree it would've been nice to have examples of pitfalls and misconceptions. There's a lot of them that can take a very very long time to learn without direct examples. I've never even heard of Jeff Fried (not till this post at least). So props to people who can actually read those kinds of books.
Im also inclined to say just use it, based on pareto principle. regexr.com is the best, and has a cheatsheet section. Further, hover tooltips explaining expression behavior. Our team has an informal policy of “link to regexr demo” in comments at callsites for anything absolutely nontrivial.
First day on the job as tech support at a local dialup ISP/CLEC My manager gave me a regex cheatsheet, because we had a gigantic multi-file spam filter that we would all need to troubleshoot, and some of us were allowed to edit. It was fun.
Cheat sheets are the way to go though, especially because of the different versions. If you do enough with them one day the main stuff will just stick. Once you're fairly productive you will realize you missed a feature or trick that is particularly useful for what you've been doing, and after getting mad at yourself for missing it you will add that into your repertoire. repeat.
Also, don't be afraid to just split/cut and do it in your language of choice instead of regex. Most of the time it doesn't make too much of a difference performance wise. Many times it can be faster and/or more readable. The best approach is often a combination. Nobody likes the wizard that tries to put everything into one regex to rule them all.
Regarding versions, I learned with PCRE, have mostly worked with python, and have hit problems using other various implementations over the years. Though it's never enough of a problem that I can remember what those differences are, I just look it up and move on. Unless it's going to be an ongoing project, in which case I print out a new cheatsheet and hang it up.
Muddled through for a decade with a vague sense that nothing worked as I expected it to. Gradually realised there are many variations with different behaviour, obfuscated by dreadful documentation and emergent behaviour in the implementations.
Then wrote a regex engine. It's now extremely obvious how regular expressions work as they're very simple. The spurious divergence in syntax and semantics is still infuriating but at least I know what they're supposed to be desugaring to now. Recommended as a worthwhile exercise.
Regular expressions have "generate a letter of the alphabet" as a primitive. It might be ascii and use 'b' to generate that letter, or a hex escape like \x42 or similar. The notation varies a bit. Another primitive is "generate the empty string".
Then there are compound operations. One regex or another, one followed by another. Intersection, complement. All the set operations, for the reason that a regular expression is literally a notation for a set of strings.
Some things like "lookahead" are notation for intersection. The match previous construct, \2 or similar, takes you out of regular expressions but works like checking equality on the fly.
Finally anchors, $ or ^ etc, are specific to the match problem. It's still find an element of the denoted set but with some extra constraints on where the element can occur.
I'm pretty sure that's all of it. How anchors interact with the set description is a nuisance but seems well formed - I haven't bothered to work through that part yet because I'm mostly interested in string generation, not matching.
The rules aren’t that hard but actually applying it to code and honing it to consistently pull exactly what you want is in my experience the hardest part.
When I was young I used a chat program developed by occasional HN commenter @krazydad. It had a client-side scripting language called IPTSCRAE (https://en.wikipedia.org/wiki/IPTSCRAE), with which I could write commands or eventually chatbots with the incantation
{ ... } CHATSTR "(.*)" GREPSTR IF
Since I was about 11 years old, my brain was plastic enough to take the postfix notation in stride, and regular expression syntax is still second nature to me.
What I do now is I write a comment like this in my IDE:
# a regex that selects <something>
regex =
And then copilot/supermaven auto-completes it. If that doesn't work, I ask GPT-4o/Sonnet. If that doesn't work, I assume that whatever I'm asking for is not really a natural fit for regex and I should accomplish my task in a different way.
In general I try not to use regex in production code. IMO it is an obsolete technology at this point. Most people do not know it well and trying to debug it is a nightmare. May I suggest a simple function or loop that is readable?
I only used the bare minimum for years.
I also hung out on a #regex IRC channel, so I got exposed to questions and answers by many people.
Later I read up on https://www.regular-expressions.info/ which has a lot of very good explanations.
The #regex IRC channel had an IRC bot with a quiz with 28 levels.
All sensibility ended after level 14 or so. At that point it was just "how deep does the PCRE rabbit-hole go?"
But there was a lot of useful, non-trivial stuff, too. Most specifically, look-aheads/lookbehinds, non-greedy matching, back-references, named capture-groups, character classes, anchors,
When I learned jq, I went much the same way: Started hanging out on #jq IRC channels and started trying to answer jq questions on StackOverflow. Sadly, I got outperformed the first six months, until it finally clicked.