
Learn Regex the Easy Way - shubhamjain
https://github.com/zeeshanu/learn-regex
======
maddyboo
My biggest issue with regular expressions is remembering the exact syntax of
each of the different common regex engines. Javascript, Perl, `grep`, `grep
-E`, vim, `awk`, `sed`, etc...

Each one seems to have slightly different syntax, require different characters
to be escaped, has different defaults (is global search enabled by default?
Multiline? What about case sensitivity?), some don't support certain
lookarounds, how does grouping work, and so on.

~~~
sglane
There are three types of regex as far as I know: basic (aka GNU), extended and
PERL. Grep uses GNU as the name implies. egrep or grep -E uses extended. PERL
is used elsewhere like JavaScript. Typically you'll see pcre which is the
library for Perl Compatible Regular Expressions.

~~~
jwilk
The name "grep" is unrelated to GNU. According to The Jargon File, the
etymology is:

 _from the qed /ed editor idiom g/re/p, where re stands for a regular
expression, to Globally search for the Regular Expression and Print the lines
containing matches to it_

In fact, grep predates the GNU project by almost a decade.

Also, it's the first time I hear about "GNU" regexps.

~~~
Hello71
a more accurate term would be "BRE with GNU extensions", e.g. \\(\\)

~~~
jwilk
\\(\\) are part of BRE syntax. They are not GNU extensions.

Source:
[http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html#tag...](http://pubs.opengroup.org/onlinepubs/7908799/xbd/re.html#tag_007_003_006)

~~~
Hello71
er, I meant \\{\\}.

~~~
jwilk
\\{n\\}, \\{n,\\} and \\{n,m\\} are all in POSIX.

\\{,m\\} is a (pretty obscure) GNU extensions.

~~~
Hello71
hm, I could've sworn it wasn't. [http://www.regular-
expressions.info/gnu.html](http://www.regular-expressions.info/gnu.html) says
that \?, \\+, and \| are GNU extensions though.

------
clement75009
For me, the best Regex ressource is still
[http://regexr.com](http://regexr.com)

It explains what each character does just by hovering over a regex. Best tool
to learn or to fine tune your regular expression (with testing included).

~~~
chrisan
This is a paid app, but the debugger on this has been my favorite

[https://www.regexbuddy.com/](https://www.regexbuddy.com/)

You can even step through the matching process and see how your matches are
made

~~~
squeaky-clean
Love RegexBuddy. That feature along with the "use" tab which generates code
for whatever language you select. I don't have to remember the specifics in
all the languages I work in. I just select from the dropdowns, for example,
"JavaScript (Chrome)" then "Use regex object to get the part of a string
matched by a numbered group". Replace the placeholder variable names, and
you're good to go!

It will also do things like warn you if you use named groups if your selected
language doesn't support them, and the "Use" dropdown won't provide that
option.

I really wish it wasn't Windows only.

~~~
chrisan
Hah wow... I've had this app for over a decade and never noticed that
feature... right next to the Debug button which I have used numerous times on
really gnarly regexes

------
reuven
I've been teaching regular expressions for years, and offer a free online
e-mail course on the subject
([http://RegexpCrashCourse.com/](http://RegexpCrashCourse.com/)).

This site is a very nice summary of regexp syntax and is written well -- _but_
it's missing two crucial pieces that help people learn: Examples and
exercises. Without practice, there's no way that people can remember the
syntax.

~~~
darkstar999
It isn't missing those. Every section links to regex101.com prefilled, ready
for experimentation.

~~~
reuven
I'll revise my criticism, then: There are some examples, but not enough. And
the links to an interactive regexp system is very smart and nice.

------
modalduality
Good article (didn't realize there were other kinds of lookaround), but maybe
the bottom should link to well-tested standards-based regexes instead.

    
    
        URL: ^(((http|https|ftp):\/\/)?([[a-zA-Z0-9]\-\.])+(\.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]\/+=%&_\.~?\-]*))*$
    

I recently encountered a case where a URL had an underscore at the end of a
subdomain name. It seems underscores are okay anywhere else, but while my
friend on Windows was able to load the website, I wasn't (on Linux) using
Firefox, curl, remote screenshot service which presumably ran Linux etc.
According to various RFCs, they should be okay anywhere within the subdomain
name.

Has anyone encountered this behavior? Couldn't find anything on the internet;
maybe it's just my computer?

~~~
keeperofdakeys
It seems to mostly come down to differences in how things are defined. DNS
itself can handle almost arbitrary data
[https://tools.ietf.org/html/rfc2181#section-11](https://tools.ietf.org/html/rfc2181#section-11),
while an Internet Hostname was defined to be more strict
[https://tools.ietf.org/html/rfc1123#section-2](https://tools.ietf.org/html/rfc1123#section-2).
The same issue also exists with dashes at the end of domain components.

I'm not enough of a history boffin to know how Microsoft came to support it
differently (perhaps something from the Netbios and NT era). At this point in
time though, I don't see either party changing their default validations to
agree on a single definition.

Edit: If you're curious, this is the first commit that appears to be the first
glibc commit limiting dashes at the end of URLS
[https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fa0bc...](https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=fa0bc87c32d02cd81ec4d0ae00e0d943c683e6e1#patch31).
I don't know about BSD libc, or windows however.

~~~
tzs
Wait a second...does this imply that if I put downloads that should only be of
interest to our Windows customers on a server named something like
downloads_.ourdomain.com, it might keep out all those annoying bots that
ignore robots.txt and make a lot of noise in my logs? I'm guessing that most
of the bots are not running on Windows.

~~~
keeperofdakeys
That's a pretty bad idea, you shouldn't rely on this kind of stuff.

If there are people running OSX or Linux that want Windows downloads, or
someone is behind a captive portal or proxy (like squid), they probably won't
be able to reach it anymore.

If you have a real problem with bots, I'd look at what IPs they are coming
from, and how often they try to connect. Something like IP blacklisting, or
fail2ban might work for your use case.

------
dayvid
[https://regexone.com/](https://regexone.com/) helped me finally learn Regex
in 2-3 hours.

It's a step-by-step interactive site. One of the best educational programming
sites I've been to.

Testing Regex is a lot easier when you have the fundamentals down and there's
a million resources to test Regex.

------
frou_dh
I learned regex the "Ambient" way.

i.e. Encountering them here, there and everywhere. Then one day realising you
have a good knowledge of the subject without ever having set out to learn it.

------
reificator
I spent some time reading some resource or another on how regexes work, but
the vast majority of my learning has been trying things in
[https://regex101.com/](https://regex101.com/) and seeing if they do what I
want. The breakdown on the side of the page is especially helpful.

------
retox
Waiting for the additional "Read someone else's Regex the easy way". I'm not
holding my breath :)

Agree with others in that RegexBuddy is indispensable for a windows dev
learning this magic stuff.

Some useful and interesting regex developments coming in the next version of
JavaScript. Support for international text and (bleh) emoji incoming.

------
Groxx
And when you think you've learned regex, learn that you haven't:
[http://fent.github.io/randexp.js](http://fent.github.io/randexp.js) (a regex
"reverser" of sorts) [1]

Seriously. Test _every_ non-trivial regex with something like this, you'll
probably be surprised at how permissive most regexes are.

Regexes are great. They're super-concise and perform amazingly well. But
they're one of the biggest footguns I know of. Treat them as such and you'll
probably do fine.

\---

[1] for instance, the URL regex they use is incorrect, and it's super obvious
when you plug it into that site:

    
    
        ^(((http|https|ftp):\/\/)?([[a-zA-Z0-9]\-\.])+(\.)([[a-zA-Z0-9]]){2,4}([[a-zA-Z0-9]\/+=%&_\.~?\-]*))*$
    

`[[a-zA-Z0-9]\\-\\.]` you can't nest character sets like that. So this matches
the letters "[]-." as well as all a-z,A-Z,0-9 ranges.

------
jules
I didn't truly understand regular expressions until I saw how they are
executed. There are simple algorithms for executing them, so that might not be
such a bad way to teach.

~~~
hackermailman
I learned from the book _The Unix Programming Environment_ but I also didn't
truly understand regexp until I read how they were executed, in _Programming
in Standard ML_ the first chapter shows how to implement a complete
package/parser for regexp
[http://www.cs.cmu.edu/~rwh/isml/book.pdf](http://www.cs.cmu.edu/~rwh/isml/book.pdf)

------
Willamin
The lack of readability of regex makes me wonder if there isn't a better way.
I've seen Elm's parser which introduces a few neat concepts like parser
pipelines. [https://github.com/elm-tools/parser](https://github.com/elm-
tools/parser)

~~~
mclehman
Have you looked at Perl 6 at all? Whitespace in regexen is insignificant if
not quoted, so not only can you add a little space between sections, you can
split a regex over several lines and add comments throughout.

It also has first-class grammars, so you're less tempted to reach for regex
when something more powerful would help.

------
gregmac
For some people Regex Golf [1] might be an interesting way to learn. You are
actually building increasingly complex regex as you go, and can just look up
bits of syntax you don't know as needed.

[1] [https://alf.nu/RegexGolf](https://alf.nu/RegexGolf)

------
crncosta
I really enjoy this type of tutorial format, concise and easy to follow.
Thanks for take time to produce it.

------
chenster
A picture is worth a thousand words. How about a visual regex tester -
[http://emailregex.com/regex-visual-
tester/#a%5Cbc%5Cd*](http://emailregex.com/regex-visual-tester/#a%5Cbc%5Cd*)

------
VeejayRampay
Very good post. It has all the important information, provides clear examples,
doesn't try to get too fancy or showboat. Well done.

------
jwilk
Previously:
[https://news.ycombinator.com/item?id=14846506](https://news.ycombinator.com/item?id=14846506)

I'm afraid not much has been improved since then.

This is _not_ a good learning source.

------
gmac
For an even-more-beginner's guide, see the slides to a session I teach to Econ
postgrads once a year[1].

These introduces the metacharacters gradually, using a task-based approach. We
start by finding street addresses, per
[https://xkcd.com/208/](https://xkcd.com/208/).

[1] [http://mackerron.com/text/text-
slides.pdf](http://mackerron.com/text/text-slides.pdf) (page 19 onwards) with
supporting resources at
[http://mackerron.com/text/](http://mackerron.com/text/)

------
chenster
For email, regular expression, there's
[http://emailregex.com](http://emailregex.com)

------
j05huaNathaniel
Could use some work explaining capture groups

~~~
com2kid
Capture groups are where I get 99% of value from regexs. Being able to quickly
transform data is where I find regexs perform best. Just matching is not
something I have to do all that often.

