Hacker News new | past | comments | ask | show | jobs | submit login
Learn RegEx step by step, from zero to advanced (regexlearn.com)
364 points by aykutkardas 71 days ago | hide | past | favorite | 108 comments

I learn regex just before I need it. Every. Time. After 25 years, it just doesn't stick.

I was excellent at regex early in my career... actually had a job where that's basically all I did for 9 months. Read the O'Reilly book Mastering Regular Expressions from cover to cover and referenced it multiple times per day. Doing regex at a high level was instinctual.

Lost much of that knowledge within a couple years after leaving that job... was shocked how much wasn't retained when I stumbled into another project that required a fair bit of regex work. There was some muscle memory involved and I was able to ramp up quickly, but now 20 years after that initial job I'm just like you.

In fairness, I don't think I was ever a heavy user quite like you describe but if I take another language that I knew well and now dip into every so often - like C - I'm going again in 5 minutes. I don't find that with regex. I do try ... I think it's cool but it just slides straight out.

I think the difference though is C is not unlike most procedural programming languages. The main thing I imagine is coming back up to speed w/ stdlib and having to get back into the rhythm of managing memory. Regex is unlike anything you typically do on a day-to-day basis. It's just an alien mental model and once you get past the basics feels like you've acquired some sort of superpower few actually need.

It's like being a native English speaker who also speaks a few other latin based languages vs. having that one eastern language thrown into the mix.

Same. I just don't use it often enough for it to really stick. For many years my regex knowledge was basically stuck at how to use ".*" and things like [a-d] or [a-dA-D] blah, etc. Never used backreferences, capture groups, etc.

The thing that finally forced me to dig a little deeper was an assignment at work that involved Apache HTTPD and mod_proxy and the need to define some really complex routing rules that were imposed on us by something upstream of our service. We wound up having to peek into the incoming URL and route things differently based on sub-elements of the overall path. So I finally had to learn to use capture groups and get into the difference between the "greedy" and "non greedy" matching, yadda, etc. And the thing is, when I figured it out and got all that working, I felt like I'd acquired a new super-power.

For about 3 weeks. Now, I'm pretty sure all of the new stuff I learned has totally escaped my memory again, because - once again - I haven't had any call to touch a regex in almost 2 years.

sigh I should probably look for an Anki deck on regexes and start doing spaced repetition on them just to try to finally get this stuff locked in.

I simply don't want to memorize 3 or more different versions of the syntax codes.

Sometimes I'm using .Net. Sometimes I'm using Python. Sometimes I'm using whatever oddball engine the developer chose.

I know how regex works. I can use forward and backward references. I can combine complex patterns. I can match, extract, replace, transform, etc. Sometimes I have even used nested patterns (though I'd probably need half an hour to re-learn it well enough to read one). But I'm not sitting down and memorizing the difference between /d and /D, or \S and \w or any of that. Frankly, I'm very lucky if I remember the difference between ^ and $. I will have the .Net[0] and the Python[1] doc in my bookmarks forever.

I'm not remotely ashamed of it, either. The codes are completely arbitrary with absolutely no intrinsic meaning. Worse, it's not easy to tell the difference at a glance between literal characters, character classes, operators, wildcards, special constructs, etc. More than once I've been confused by a regex only to discover it does something I didn't know they could even do. Regex patterns are meant to be concise and comprehensible to the regex engine, not to the programmer.

Don't feel bad because you don't memorize an arbitrary and complex syntax. Memorizing syntax is not the job of a programmer. The job of a programmer is to compose the logic and design the system and know that a syntax exists to compose it in. A programmer is an author, not a linguist.

[0]: https://docs.microsoft.com/en-us/dotnet/standard/base-types/...

[1]: https://docs.python.org/3/howto/regex.html

That's one heavy pain point in emacs for instance. Even though it seems trivial to map char classes and syntax switch it's so utterly tiring. To the point where my first reaction is "cli unix tools share most of the syntax, it's so fluid... runs grep"

I agree. Regex for me is particularly slippery - I haven't needed to code 68000 assembler since 2002 but I had a play with an emulator last month and slipped back into it. There's something about the syntax that my brain just can't hold onto.

Same here. Because I never need "regex" as a skill by itself in a project, but just needed to search/replace one damn string for one particular situation, and next year another situation in a completely unrelated place for completely unrelated reasons so whatever I learned (and forgotten) last year wouldn't really apply anyway.

I'm the same way. I suppose some people use regex's a lot, I never really have.

If our experiences are typical, I'd argue people new to regex's should more learn they exist and when they can be useful, and not worry too much about actually learning their mechanics.

This was my approach and I found it pretty successful the first and most recent time I used regex.

I think that's okay. It's the same for me. The important part is understanding the concept of regex enough to know when you need it.

I have some basics down pat (., +, *, ?, ^, $, [], [^], (), \d) and anything more I always have to look up (and much of it differs between engines anyway). Usually, though, what takes more time to figure out than the regex itself is whatever unholy combination of escaping rules is in force in the place where I have to use it.

I'm with you - if I'm doing a quick find/replace in VSCode then I can usually muddle a way through but anything more complex leaves me somewhat at a loss.

That course looks great and I'm sure that at the end of if I will be absolutely rocking at regex. Then I won't use it for 4 months and need it for something and all but the basics will be gone. I'll essentially have to relearn it each time. I don't think it's the learning material that's the problem, I've even done interactive Jupyter training, there's something about Regex itself that just won't stick.

I believe part of the problem are the slight syntax variations in different languages/environment. I've been writing regexes in Unix utilities (vim, grep/sed) for the most part and when I found an answer on SO about a specific problem I was trying to solve, it was in JS. I barely understood what was being written to the point I stopped bothering "transcribing" it to UNIX syntax.

Why use it though? I'd rather wright or read 30 lines of code instead of trying to decipher what a regex means or do and making sure it is covering all possible cases.

I don't understand the goodwill toward regexes. It's basically an embedded BrainFuck in your programming language.

I'm 14 years in my dev career, there never was a moment where not using regexes came to be a problem.

> Why use it though?

Because a well written regex performs extremely well (regex engines are often very highly optimized).

It gives you all the benefits of using a domain-specific language and using an extremely mature software library. Just like a domain-specific language, it will have a baked-in philosophy involving the exact task you want to accomplish, so it will not suffer from language vs algorithm impedence. Just like using a mature library, it will probably have accounted for weird oddball cases that you're not even thinking of and have enough features to do everything you will want.

> It's basically an embedded BrainFuck in your programming language.

I don't disagree. It's not easy to read and can be hard to maintain. There are ways to write regex such that it's easier to understand, but the syntax generally doesn't make it easy to do that and doesn't encourage you to spend the time on it.

However, when you see a regex, you do know that it's 100% used to manipulate strings. That alone tells you quite a bit about what is going on.

I think it's case-by case. I once scraped a list of names and dates from individual Wikipedia pages. There were lots of formats like "1900-1950", "1900 - 1950", "(1900 - 1950)", "(1900 to 1950)", "1900 to cf. 1950", and so on. These were arbitrarily nested in the first couple sentences.

My thought was "Oh I think this is a job for that regex thing" and 35 minutes of googling syntax + a handful of passes later I had all the dates in a workable table. I have no idea how much code that would have taken. Albeit, I am a novice programmer.

  [0] Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
I love regular expressions and figuring out the syntax to solve problems I encounter where I can use regex. Such as searching code for calls with specific named params that may be in different order across the code base. But I always have this quote in the back of my mind if I'm thinking about using regex in code, especially areas that get hit a lot.

[0] https://blog.codinghorror.com/regular-expressions-now-you-ha...

It's really subject-specific. What I usually see is a small problem easily solved by Regex, that a few months later down the line we need to expand the regex a little bit, but that's ok it's still readable... repeat this a bunch of times and now you have a critical regex line that nobody really understands how it works and nobody ever wants to touch without it being unit-tested to death.

It's just a riff on tech-debt, imho

We need a natural-language <-> regex translator

I am a huge, huge fan of https://regexper.com/ - paste pretty much any regex, and it will generate a visualization of what it does. It's been invaluable even just as a sanity check, it just makes it so easy to trace the flow of what a pattern is doing.

Back when I used Windows, I had an app called Regex Buddy and you could paste in any regex and it would generate a comment block that broke down what was going on. Was super useful when you come back a few months later and need to understand a complex regex

I once asked my friend: "Is there anything Regex can't do?" He replied: "Work the first time?"

And that's how I always think of Regex now. It's an incredibly powerful tool that I pull out pretty often, but if I'm doing anything beyond the most basic pattern matching I know I'm about to lose a bit of time questioning my abilities and sanity.

I respect regex for being a widespread DSL that makes certain things very declarative.

My problem with it (other than all the different flavors—that’s just history’s fault) is that you don’t tend to get integrated help/tool support for them. They are often just plain strings—how is e.g. Intellij supposed to spot them reliably and help you?

It’s too bad if we have to copy and paste regex strings into websites in order to figure out what they do.

The number one most useful tool support in my book would just be to highlight metacharacters. Are parens a metacharacter in the dialect that I am using now? Is `+` significant?

A bigger ask would be more verbose regular expression declarations and compilers that can translate to and from them. It’s nice if you can use comments in a regex dialect, but comments should be treated like they are treated in code writ large; don’t use them if you can make the code “self-describing”. Imagine a more verbose declaration language where you can make local aliases, for example `let ident = [a-z][a-z1-9_]*`.

I agree that it would be nice to get native syntax highlighting for strings, especially for languages where regular expressions are not a first class feature

However, I would realllyyy not want to have a more verbose regex dialect. For me at least, it would make recognizing common regex patterns a lot harder since they won't be succinct. It would also be yet _another_ regex variety that I would have to remember.

For me, Perl's DEFINE feature and 'x' switch are definitely close enough[1] to variables and comments. Pardon, I don't remember what they are officially called. What we really need is for more non PCRE engines to implement these features

[1]: https://regex101.com/r/z5pGvZ/6

> They are often just plain strings—how is e.g. Intellij supposed to spot them reliably and help you?

WebStorm or any IDE that has WebStorm functionality built-in can detect `/pattern/` in Javascript and give you an option to test it and warn you of syntax errors. PyCharm will help you with functions in `re` module, Rider with `Regex.IsMatch` and so on, all JetBrains IDEs have some sort of contextual help that gives you guidance on stdlib functions that accept regular expressions.


Julia has a "readable regex" library that uses descriptive English names. It's a lot like a SQL query builder, but it builds regex.



> how is e.g. Intellij supposed to spot them reliably and help you

It depends on the language. JavaScript has a dedicated RegExp object. If you use the literal notation or constructor, you will always get IntelliJ support (unless you decide to use the constructor and extract the regex string for whatever reason). I _think_ you could also mark a standalone string variable with JSDoc to get IDE support, too.

I have learned regex, I learned it from regexone years ago so appreciate this type of teaching method. What I will say though it it's a little pedantic if you won't let me move on despite my solution being correct. An example is this:


When the answer required was (if I remember correctly):


I think as long as you're correct you should be allowed to move on. By all means show your preferred solution.

What's also a bit of a pain is when you click Show Solution you still have to type it in the box to continue.

Apart from that I love it! I normally use regexr.com and just mess around with it until I get the desired result. It also helps with learning, but you end up never truly understanding the concepts.

For most purposes, sites like that or https://regex101.com/ are fine. But, small differences in regexes (greediness, backtracking) can make huge performance differences that you won't notice when testing it on 2 or 3 lines on a site like that.

A nice example was a big Cloudflare outage in 2019: https://blog.cloudflare.com/details-of-the-cloudflare-outage...

So for anyone using regex in (1) production as (2) part of an automation / regular process (i.e. not a one time search) on (3) sizeable amounts or reoccurrences of data, I'd really advice gaining a deeper understanding of what the various options do.

The english copy of some of the learning examples is rough and complex, which takes away from the lessons.

An example from 21/55:

"To express at least a certain number of occurrences of a character, we write the end of the character at least how many times we want it to occur, with a comma , at the end, and inside curly braces {n, }. For example, indicate that the following letter e can occur at least 3 time."

It should read something like:

"To express a match of a minimum number of character occurrences, we use a comma after the number of occurrences, within the curly braces {n, }. In the following, try building a regex to match the letter e occurring a minimum of 3 times sequentially."

>Regular Expressions, abbreviated as RegEx or RegExp, are a string of characters created within the framework of RegEx syntax rules. You can easily manage your data with RegEx, which uses commands like finding, matching, and editing. Regex can be used in programming languages such as Phyton, SQL, Javascript, R, Google Analytics, Google Data Studio, and throughout the coding process. Learn regex online with examples and tutorials on RegexLearn now.

I always look at these intros/descriptions of Regex with a heavy heart. They describe what regex's are, but none of the info is going to make much sense to someone who doesn't already know why they would want to learn them.

The best motivation for regexes that I've read is actually from a Python Tutorial [0] where the author gives an example of writing a lot of nested 'if' statements that could all be solved by a single regex. On the whole, I think regexes are one of the most powerful tools that doesn't have enough publicity in large part due to this Catch 22 of trying to explain what they are.

[0] https://automatetheboringstuff.com/chapter7/

> They describe what regex's are, but none of the info is going to make much sense to someone who doesn't already know why they would want to learn them.

I frequently use the journalist's "5 Ws and H" framework as a checklist procedure for ensuring my technical communication covers fundamental questions/ideas:

* Who

* What

* Where

* Why

* When

* How

The slightly tricky thing is that you have to formulate a question for each W based on your domain. For example what is a fruitful "where" question for RegEx? Nonetheless the checklist makes me less likely to miss very key ideas, such as "why" one would use RegEx.

To make this idea more procedural maybe we could just formulate it as ungrammatical questions where you put the key topic after each W:

* Who RegEx?

* What RegEx?

* Where RegEx?

* Why RegEx?

* When RegEx?

* How RegEx?

And then just let your mind flesh them out into more complete questions...

That's a helpful list / framework. Thanks.

That said, where I see most tech sites / products fail is on addressing benefits. Why should I care? (As opposed to the brand or product why.)

I wish I had $20 for every "Looks cool. But it's not clear to me my life will be any better."

I think regex is absolutely terrible garbage. It’s very powerful, I do appreciate not having to write complex conditional statements, and it’s great that it’s available is so many languages and applications. But it’s just a bad tool, terribly unreadable, easy to introduce bugs, with lots of trial and error. It has too much brevity and would be much better if it were longer but more human readable. I’m sure there are other string search paradigms that are far better but relatively unknown.

You can get a feel for what that would look like here: https://metacpan.org/release/CHROMATIC/Regexp-English-1.01/v...

But then you're just memorizing things like 'start_of_line' instead of '^'. Perhaps easier to read, but no easier to write.

        -> start_of_line
        -> literal('Flippers')
        -> literal(':')
        -> optional
                -> whitespace_char
        -> end
        -> remember
                -> multiple
                        -> digit;
I literally can’t parse this as a whole. /^Flippers:\s?(\d+)/ is so much more obvious compared to that utter nonsense.

Like most code, it's easier to write regex than to read it later. In my recent vim history:

This was from two days ago. I think I was searching a huge sheet of regex match groups for any having line breaks to join. In a month, I'm not even sure I would recognize that I had authored this.

So what. That was a problem you had to solve, imagine how helpless you’d feel if you had it with no regex available. Matching non-parenthesis or newline for two lines (prefix and suffix unrestricted) it is. Idk if it took half an hour or more to implement that in python, js or (god forbid) a low level language. You probably made it in less than a minute. And nobody would take their time to read a page of .substr(i, -(j-i)-1) two days later either.

not every solution has to be reusable

Your long-hand isn't quite the same as your regex...it should be remember -> one_or_more -> digit;

In regex parlance, \d+ explicitly allows for one or more digits. Multiple tacitly implies 2 or more which would be \d{2,}

Also, your end char (which I assume you mean $) would be after the remember -> one_or_more -> digit;

I didn’t refer to the manual (which is the entire goal of that format, isn’t it?) and don’t know what ‘multiple’ really means. So I stand both corrected and confirmed, I guess.

That ‘end’ thing just closes the ‘optional’ group, I believe. There is no $ in an English form of this regex either.

Readability is very important though. If you can spend a couple of more seconds of programmming time to prevent several minutes(or longer!) of understanding time, I'd call that a good use of resources. I don't think that link is quite there yet but it's a good start.

It's more readable individually, but for many regexes the verbose nature could make it harder to read overall.

There's a good article about K that can give you a feel on how long names may not always be more readable: http://nsl.com/papers/denial.html

The readability isn't so bad if you let yourself allocate as much time and mental effort to understanding the one-line regex as you would use to understand the 100 line string-processing function that it replaces. And the brevity makes regexes handy on the command line and in single-line input fields in text editor search functions.

I do prefer using parser combinators for more complex tasks.

I’m sure there are other string search paradigms that are far better but relatively unknown

Sure if they were, we’d already discover them. All of the regex criticism boils down to few simple statements for categories of cases:

1) I didn’t learn regex and have no cheatsheet

Learn it or at least print a cheatsheet and stick it to the wall.

2) The problem that this specific regex solves is a hell of a regular problem under any representation.

Any particular regex is only as terrible as a ladder of corresponding if’s and for’s would be. Deal with it.

3) The problem that this specific regex solves is not a regular language.

Use a proper xml parser.

You seem to forget that regular expressions are pretty much simply required - and at least for their more simpler cases, their syntax is reasonable - 'syn[a-z ]+?able' is far from unreadable and unwritable.

You have some text to process, open your text editor, you will probably use a dozen regular expressions for that - this is very frequent for many. Can you conceive a better syntax, at least for the simple cases?

Ignoring the flame bait, for me the only things I wish more regex engines supported (cough JavaScript) is the ability to ignore whitespace, and have named groups. Python has a flag to do this, and being able to have multi-line regexes with comments and named groups is phenomenal and greatly improves readability of more complex regexes.

In general I would say ~70% of regexes are highly readable. With tools like the above, you can probably go to like ~85%? There are some regexes that are super complicated and then likely should be refactored into a composition of simpler regexes. But that's just a guess. I wonder if there are any studies done about this...

> Ignoring the flame bait, for me the only things I wish more regex engines supported (cough JavaScript) is the ability to ignore whitespace, and have named groups.

Irregex? http://synthcode.com/scheme/irregex/

Interesting! I don't think that's what I mean. I don't think I want it to be a part of the language like that, but that's a pretty neat idea. For example, python has an 'X' flag you can use when creating a regex to allow new lines and comments. Here's an example from my code: https://github.com/internetarchive/openlibrary/blob/1ac15a48...

I’d argue that regex is elegant and an incredibly useful to have in development… but it’s definitely definitely easy to have ‘too much of a good thing’ here.

Regex is a great tool. Just use good taste and don't overdo it with regexes.

They're very effective at what they do as long as you don't make insane brainteasers that make people curse your name.

This is why I like tools such as RegExBuddy which breaks down the regex into a graphics. It does real-time match highlights of test text and emulates most Regex engines.

It’s just a small programming language.

> I always look at these intros/descriptions of Regex with a heavy heart. They describe what regex's are, but none of the info is going to make much sense to someone who doesn't already know why they would want to learn them.

> Catch 22 of trying to explain what they are.

any teachers, or people who explain/document things for a living, have some good tips or templates to avoid this?

I don't see what's so hard.

"With regex you can search for any combination of characters in a string or return any such combo or modification you like"


1) Encourage whoever you're teaching to stop you immediately if they don't feel like they understand something you're saying, even if it's a single word that's throwing them off, and especially if they're not rock-solid about a simple concept they "should already know". Modern school teaches people that "returning to the basics" is a waste of time; but as Feynman says, you should return to the basics often, as masters do. Pianists don't stop playing scales once they're famous. This means that if your student want to review what an "expression" is, or a what a "string" is, or what "returning" means, you've got to encourage them to do it. If a 10-minute explanation of RegEx turns into a 45-minute review of how the string variable type was invented, that will be more useful for the student in their pursuit of RegEx mastery than will a technically accurate but shallow regurgitation of your 10-minute spiel about what RegEx is. This is because they need to lay the mental framework of how they're going to think about RegEx; you are able to explain it in 10 minutes because you already have that built in your head, but they need to build those background pathways and connections themselves before analogies and summarizations make sense.

2) Try to figure out how you can make them experience the problem that led to the invention of RegEx. A student will never truly understand why a solution is valuable until they really, deeply understand the problem that the solution is solving. Note that I'm not saying that you need to teach the problem before the solution--not every student needs them in that order--just that they won't master the solution until they understand the problem.

3) In lieu of "testing" a student, have them take many breaks to re-explain what they've learned to you, even if you haven't reached a real conclusion about anything and are just checking that they understand a sentence you said. Many students, especially if they have a good teacher, will experience the sensation of comprehension even if it's not actually there. This is the "it makes sense when he says it, but when I try to explain it I can't find the words" phenomena. Taking frequent breaks to have them explain things back to you in their own words will reveal their conceptual weaknesses, and those are what you focus on.

4) Don't try to get it all done in a single session. Learning requires both forgetting and sleep. First, you should tell them to expect to forget, and that they will need to come back over and over again to topics that seem basic or simple; forgetting is part of the process of learning, like painting multiple layers on a wall. Second, they need to sleep in between sessions, which means that you can't teach everything in one day and you can't learn everything in one day, and multiple days may need to be spent reviewing the same material.

This all makes a lot more sense when you treat learning like sports. Learning <programming topic> is like learning a slice serve in tennis. You don't need to serve slice, especially if you can hit flat serves at 115 mph, but serving slice is an invaluable technique when you're playing someone who can't return slice serves at all--that's a near-guaranteed 3/6 games out of every set. But in order to learn it, you need to focus on your tennis fundamentals (stay loose, eye on the ball, toss correctly), practice the same basic movements over and over again, get lots of sleep, and understand why you're learning the skill in the first place.

Good answer. I think 2) is the one that jumped out at me because it reflected my own experience and understanding - Regex became easier to understand when I also felt like I understood its motivations. Starting there, with motivation and context, is my typical go-to move.

Very valuable insight, thank you a lot!

Doesn't make sense to me.

> Regular expressions (commonly known as "regex") are used for advanced pattern matching in strings. They can also be used to replace text, transform strings, or extract substrings. It's a very powerful domain-specific language that is purpose-built for string patterns and manipulation. Many general-purpose programming languages include regex engines that use similar, but often slightly different syntaxes to support the use of regex.

>teachers, or people who explain/document things for a living

I'm neither of those, but I frequently explain things to my friends and they say I explain well. So I will throw my two cents anyway and hope you don't find them trivial self-help platitudes.

(1) Start with Concrete things

No learning ever starts from generalities. Never start with something like "Regular Expressions is a declarative language to describe strings of a certain general form blah blah blah", I call this the wikipedia style of teaching, an utterly useless word-swapping game where you explain things and constructs in terms of even more complicated (or equivalently complicated) things and constructs till the learner runs out of stack space and comes out learning nothing and feeling like a faliure on top of that. Remember that learning is a process of building up, you start from familiar questions, problems, specifics, themes or worldviews of the learner, then gradually introduce generalizations and solutions to get them to where you want them to be.

(This is generally a two-way street, the learner also has to know something about the teacher and where they are coming from and what are they trying to do, it's like telling a story: The author can't simply say "because I say so!" to explain every detail of the plot, but the reader can't also say "I don't know, feels too unbelievable" in response to every plot detail.)

The bare essense of regex is using meta characters to encode several string characters. The fact that the regex


so powerfully and succinctly encode string-recognizing logic that would be imperatively expressed as

fun metastar(str):

if len(str) < 4 then return false

if str[0:3] != "meta" then return false

return true

Makes the case concretely and perfectly: a single string (two letters longer than the simplest string it matches) versus 3 bug-hiding branches (e.g. what if the "!=" operator in the implementation language actually compares string-identity, not string-equality?). This is even more generous than most languages allow, the ':' array slicing operator for example is saving us a loop. (possibly inefficiently, if it's copying the slice from the string. Not a problem now for "meta", but who knows when it will be?)

Regexes are patterns, which are things that resemble the things they are describing, but aren't any of those thing specifically. It's like a dark silhouette of a man, it doesn't describe any specific man, it's a pattern that can match any man of the same general body plan and height. Regexes are silhouettes, the dark parts are the meta characters that act as placeholders for arbitary strings.

(2) Examples from real life

Don't just take the "menu approach" of reading all the features and meta characters and thinking you're explaining, actually take the time with examples. Again, examples are all that matters for the human brain, it's literally useless to tell somebody to imagine a golden mountain if they have never seen a mountain or gold before.

Our world is awash in strings of certain identifiable structures (money, dates, times, names in formal settings, equations, etc...), try to take the time to obtain several real-life examples, try to make the data come from sources like wikipedia or other publicly available dataset. After demonstrating how each of those 3 or 4 general forms of strings can be described powerfully by this meta character, give 3 or 4 more general forms to the learner to try on their own.

(3) Visualize executions, introducing debugging tools in the process

Just because regexes are declarative, doesn't mean the matching process can't be described in imperative terms, especially initially.

Later on, introduce tools like https://regex101.com/ or https://www.debuggex.com/ and always draw "Rail road digrams" that show what a given regex matches in terms of easily verbalized diagrams.

(4) Disadvantages, subtleties, and other approaches

The learning process isn't a sales pitch, there are plenty of things that suck in regexes. They are non-standard and ad-hocly designed, the runtime engine that runs them can be inefficient (unlikely if the host programming laguage is popular and > 20-years-old, but a thing to keep in mind nontheless: regexes are a whole other language, requiring a seperate interpreter or a compiler other than the one for the surrounding code), and the equivalent imperative code might not be so bad in comparison for simple cases and much more debuggable.

The name "regex" is derived from a misnomer, the orignal "regular expressions" are a mathmetaical formalism to encode finite-state machines, it orignally contained only alternation, sequencing and kleene star (the '|' and the '*' operators, plus putting letters next to each other. That's it, that was the orignal regex capabilities), when programming languages and cmd utilities started to implement them in the 70s and 80s, each started to experiment with features that break this model. For example, "capture groups", the ability of the regex to copy parts of the matched string into variables, trivially break the model : if you can capture arbitarily-long strings, then you can't be a finite state machine.

This increases power but decreases efficiency guarantees (Perl's regex are dangerously close to turing-completenss [https://www.perlmonks.org/?node_id=809842]!, the language is hiding a whole other language inside a single feature) , it also complicate the notation with symbols for the new capabilities that it wasn't designed for, with the result being the mess that regexes' syntax is now. It also means you can never "learn" regex, you can only learn (to whatever accuracy you care) Perl's regex, or Java's regex, or Python's regex. There is a vague set of commonalities, but don't rely on remembering which is a common and which is different when there are so many features implemented in so many ways.

Don't let the learner come away thinking that "declarative" is synonymous with regexes. For example there is the parser combinator style, which can encode the above example as something like:

the_specific_string("meta"). followed_by(ANY_LETTER). repeated(ZERO_OR_MORE_TIMES). build_pattern(). recognize("meta-circular")

the key idea at play here is a sort of "builder pattern". There is an abstract "parser" object that has a single recognize(str) method, and you can build your pattern by composing together the many customizable childrens that implement this abstract interface. The composition happens by "combinator methods", which takes two or more parsers and build a parser that performs a mixture of their functionalities indicated by the name (e.g. followed_by() takes several parsers and sequences them next to each other, repeated() takes a list of parsers and iterates the last one any number of times, including skipping it entirely). The things being built to represent parsers are generally (in functional languages at least) closures, but there is no reason why this pattern can't be built on top of regexes, each step simply generates the equivalent meta-character, and build_pattern returns the final pattern string.

There are tons of those "Parser approaches", formalisms, tools, patterns and libraries to express strings and string-recognition and parsing declaritevly. Regexes are merely the most famous and widespread, which is a sad state of affairs IMO.

Regex is powerful but I've found like 90% of the time I encounter one it would have been far simpler and more readable to use find + substring indexing or string splitting.

Imo replacing several nested if statements with a single esoteric regex is not necessarily a win. It depends on if pattern matching is really the best tool for the job.

find + substring indexing

An endless source of off-by-one errors, not to mention buffer overflows, index out of bounds exceptions, accidental negative indexing.

How are you both getting buffer overflows and bounds check failures in the same code?

Anyway, these are problems that arise if you don't test the code. In that scenario, regexps are an endless source of unexpected behavior as well, including in some implementations stack overflows and ReDoS attack-surfaces.

It very much depends what the use case is. I find that a lot of the text processing I do is easier to use back references or other regexy things.

Having said this, I use tools that make regexes easy to use and readily available - I think in many programming languages the syntax means that other solutions are just as easy to devise and implement.

If you are a solo dev, you do you, but if you are working in a team and you are building huge regexes with back references and other bells and whistles... I would guess it's not very readable for your teammates. At least for me, when I look at such a regex I have to stare at it for minutes before grokking it.

I wonder if there's a metric for code reviews measuring mean-time-to-grok (MTTG).

Regexps are fairly terse and replace a lot of code, compared to most languages they probably have an information density at somewhere between 10x-100x higher (i.e. it's not rare to replace 100 lines of code with 1 regex), so I think it's fair to expect it take longer to unpack their meaning.

Wold that be a reasonable time to ping that coworker and ask them "what does this do?". Not because you can't figure it out, but because they already know?

Also interesting is the fact that parsers and languages are never mentioned. Regular expression engines are parsers for regular languages, one step below context-free languages.

I wonder why there's no context-free language parsers in standard libraries. The Earley parser can take grammars as input without necessarily having to generate code, it would be a great algorithm for a standard context-free parser.

Site looks cool. Good to see you've made it interactive.

Seems like you are supporting JS flavor, that should be mentioned prominently.

Suggestions for the cheatsheet:

1) Multiline example regex is missing the anchors

2) Negative lookbehind should be `(?<!)` not `(?!)`

3) `+` and `*` examples are using `()` around a character, which isn't needed (the `?` example doesn't use it)

I've had RegexBuddy installed on my computer for so long. It's been part of my basic dev setup for years: https://www.regexbuddy.com/

I used to use it a lot, although I've not used it in quite a while. I don't recall the last time I even needed to write a regex. Somehow it's still stuck in my head.

Well worth having it if you want a tool to hack around large blocks of text and play with regex in a "live" environment.

$40 is a bit steep, but I'd probably be willing to pay for a Linux version.

I find it hard to appreciate the need for this website. First, after scrolling down to the bottom of the page, I still couldn't see where I could "start". I clicked the only prominent button featured on the page, "Product Hunt", and that took me to a clearly unrelated website. You'd think the green words "learn, practice, test, share" would be links, but nope. So after 30+ seconds of browsing it still isn't clear to me where the product was. I gave up.

Compare that with https://regexone.com/ . No fluff. I immediately know what I'm looking at and am given a simple practice problem, with a place to type smack in the middle of the page, and a responsive green coloring when I get something right. The problems advance to exactly the kind of stuff I need, with a quick view on all the lessons on the right-hand side. I'm visiting that website for the n^th time because I need to quickly refresh my regex to accomplish a task.

I don't remember why, but I learned regexes maybe my sophomore year of college because it seemed like such an interesting thing that I might need some day? I also started to learn SQL about the same time. Those feel like herculean efforts so foreign now because I've been using them for so long.

One day at my second job out of college, I wrote a quick regex for something, and one of my more senior colleagues looked at me like I was a wizard. It was amazing to be able to get some street cred for that. To me, it validated the effort I went through previously.

I'm trying to learn about SAT/SMT solvers now. Not because I have a pressing need for them but because it's a completely foreign thing that - who knows? - maybe I'll be able to put to good use today. The problem with SAT/SMT is that there are nowhere near the clear learning resources compared with, say, regexes.

I like the way you teach them by an interactive training. I have only spent few minutes and really understand why I have copied and pasted that thing from Stackoverflow. If you start with a challenge from a complex log file, it makes it easy to learn by providing extra motivation.

Some exercises are formulated in a weird way. It takes a while to understand what they want from me.

> Write the expression using curly brackets {} to select the numbers from 0 to 9 in the text that is at least between 1 and 4.

Great discussion. I stumbled upon a comprehensive tutorial from 2014 just yesterday: "Learn regex in about 55 minutes": https://qntm.org/re_en

Discussed on HN as well (78 comments): https://news.ycombinator.com/item?id=7370622

Regular expressions are one of the first things I remember being taught in my intro CS class that really broadened my perspective on programming. Probably helped that the main language being taught was Perl, and my previous programming experience was BASIC ... Expectations might have been set low. One of those tools that's often over-applied, but invaluable in so many other situations.

At first glance this looks like software that’s a bit dated and nothing special, but try to use it just once and tell me if you don’t agree it’s freaking awesome when do you want to create an expression or get a feel for exactly how it’s working.

I have no affiliation with these guys, I’ve just been impressed over the years


The website is broken when https://www.googletagmanager.com/ is blocked (on any decent uBlock Origin setup). The webmaster might want to fix this ASAP, I don't know why this kind of script is "mandatory" in modern web design...

cool site and idea, here's 3 site UI suggestions

there's a bug when entering text into the regex area which creates new padded elements show up which makes text on the page shift, e.g. https://i.imgur.com/AFBccIn.gif

the line height in the subtitle area could be increased so the text isn't so cramped ( https://i.imgur.com/HReADcS.png )

the background color of the entire page should be darker (or use some other solution) to make the <code> looking text more distinct as code text. ( dark grey on dark blue doesn't really stand out https://i.imgur.com/8Nt9YNl.png )

It would also be nice to not have to scroll the page to click the next and previous links on each learning step (pixel 5, msedge browser). 100vh maybe?

I found logstash's Grok a good tool because it is composable and you can reuse complex expressions with simpler naming. There is also a knowledge base of good regex you can use without reinventing the wheel. Rubular.com is also a great tool for working out regex matches.

This looks nice but is telling you what to do, instead of giving you problems to solve. I don't think this is a good method of teaching.

Edit: After lesson 8 you start to get problems to solve

Edit 2: I take it back, this is the first time I've understood how lookarounds work. Great stuff!

That's what it starts out, then it tests you with some problems to solve based on what you learned before.

My biggest challenge with RegEx isn't the tool per se, it's the frequency I need to use it. If it were more often I'm sure more would stick. But it's such a specialized tool and the need for the speciality is rare-ish. At least for me.

What made it stick for me: a compiler course. NFAs, DFAs, lexers, parsers, semantic rule checking, etc.

Btw: Lexer-less derivation parsers are interesting because of how simple they are to write and maintain. IIRCBIMW, they are neither LL or LR.

I like this, but the writing is in desperate need of improvement.

> The basic matcher is to type as is to choose a character or word. For example, to select the word curious in the text, type in the same way.

Regex is one of those things that I don't use often enough to really remember the syntax, so every time I end up needing it I have to relearn it from the ground up.

In terms of syntax, there are quite a few regex builder websites which can help you create and validate a regex pattern for a particular syntax.

The only way I was able to really feel like I understood regular expressions was implementing a regex engine myself.

Thank you for this. I have been trying to learn regex for a while, but this makes it easier.

started by adding "ok" in the first lesson... lock just shakes and when clicking the answer, it shows the exact same thing thats already in the box :/

You need to press enter/return to advance.

Actually, it needs to be uppercase "OK". I typed "ok" also, and scratched my head for a while. I get it should be obvious, but my brain just didn't get it right away.

Really doesn't help that the capitalized "OK" in "Start by typing OK in the RegEx field" is in a smaller font.

Now, this is the time to learn Regex the proper way.

highly recommend Mastering Regular Expressions

Are you referring to a book or online resource?

He or she is probably referring to Jeffrey Friedl's classic tome: http://regex.info/book.html

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact