
You Should Learn Regex - quotable_cow
https://blog.patricktriest.com/you-should-learn-regex/
======
submeta
When I used to code in Perl (more than ten years ago) I tended to solve many
many things using regexes. I tried to reduce many programming challenges into
transformations that I could do via regexes. Why? Because regular expressions
are first class citizens in Perl. And Perl made me learn and use them.
(Reminds me of a an APL programmer that saw vectors everywhere.)

When I replaced Perl with Python in my toolbox I abandoned regexes altogether
because I didn't know too many Python programmers using regexes.

Now a decade later I am realizing how often I do need regexes in my day to day
tasks (using them in bash commands, in my editor, and even in Python
sometimes).

~~~
philipov
Regex patterns might not be first class citizens in python, but they're easy
enough to use, and I think they're ultimately more powerful than in Perl. I
don't remember being able to use a callback to calculate my replace string as
a function of the match results when doing substitution in Perl. It's super
easy to do tokenization with python regexes. We lose first class patterns, but
gain first class functions and classes :)

~~~
kqr
Not to mention that writing a composable first-class API to contruct regexes
should be fairly elementary in Python.

~~~
philipov
oh what a good idea... wait, surely someone's already built this, right? like
sqlalchemy's expression language, but for regex. If noone's built that yet,
I'd build that.

------
tmdvs
One website I found extremely useful when initially learning regex was
[http://regexr.com](http://regexr.com). It's got a great pattern builder, and
hovering over each component explains what its actually doing.

------
jlg23
For those who prefer a more in-depth introduction than a blog post: Jeffrey
Friedl's "Mastering Regular Expression"[1] covers the theory as well while
still being very pleasant to read (I only know the first edition, apparently
there are more recent ones).

[1]
[http://shop.oreilly.com/product/9781565922570.do](http://shop.oreilly.com/product/9781565922570.do)

------
asicsp
haven't read/processed the article fully, some nitpicks:

* `\b([01]?[0-9]|2[0-3]):([0-5]\d)\b` you are using both [0-9] and \d, is that intentional?

* `cat test.txt | grep -E "^[0-9]+$"` UUoC and double quotes subjected to shell interpretation, should be `grep -E '^[0-9]+$' test.txt` or `grep -xE '[0-9]+' test.txt`

* `(?i)` won't work with `grep -E` (or at least not for me on GNU grep). features of BRE/ERE and PCRE like regex are very different - see [https://unix.stackexchange.com/questions/119905/why-does-my-...](https://unix.stackexchange.com/questions/119905/why-does-my-regular-expression-work-in-x-but-not-in-y)

* avoid parsing ls - [https://unix.stackexchange.com/questions/128985/why-not-pars...](https://unix.stackexchange.com/questions/128985/why-not-parse-ls) , use glob/find

* `.<star>?` again, this regex feature is not available with BRE/ERE, your example just happens to work. you can check it with `echo 'abc foo 123 bar 123' | perl -pe 's/foo.<star>?123//'` and `echo 'abc foo 123 bar 123' | sed -E 's/foo.<star>?123//'` (using <star> to avoid formatting issues)

~~~
quotable_cow
Thanks, that's all very useful feedback.

~~~
asicsp
for the ls parsing example, you can do

shopt -s nocaseglob

ls ~/Downloads/*.{png,jpg,jpeg,gif,webp}

\----

for command line tools (grep/sed/awk/sort/etc), you can refer my ongoing
project ([https://github.com/learnbyexample/Command-line-text-
processi...](https://github.com/learnbyexample/Command-line-text-processing))
as resource :)

------
cies
A very extensive article, and it shows how powerful RXes are.

It makes a point, in "8.3 - For Problems That Don't Require Regex" that RXes
should not be overused when a simpler solution exists.

But there are also problems that are better solved without RXes, but with a
"parser" instead. The article mentions parsers, fails to mention that self-
made parsers also provide solutions in the same problem space as RXes. Some
communities (like the Haskell community) prefer parsers as they allow to be
inspected, tested and typed. Parsers provide a more robust and hackable
solution, that is possibly faster than the equivalent RX.

Here a short tutorial for a popular parser library in Haskell, to get an idea
of what it's like:

[http://akashagrawal.me/beginners-guide-to-
megaparsec](http://akashagrawal.me/beginners-guide-to-megaparsec)

------
lowmagnet
> From validating email addresses

Guys, no. Send an email. [https://davidcel.is/posts/stop-validating-email-
addresses-wi...](https://davidcel.is/posts/stop-validating-email-addresses-
with-regex/)

~~~
Invisig0th
The article you link to addresses just the one use case where you are
registering a new user. Email addresses in business applications are just as
often NOT the address of your user. Sales contacts, managers, points of
contact, customer support addresses, etc. -- none of these should ever be
validated by sending an e-mail. So you still need to validate the hard way in
plenty of scenarios.

~~~
kqr
...but unless you actually send an email it is still just as unvalidated as it
was before. Regexes aren't a tool to determine that a string refers to a
mailbox capable of receiving mail.

If you use regexes (or any other method of that does not send emails) all
you're saying is that you don't actually care whether or not the string points
to a recipient (much less the correct recipient).

And don't get me wrong. It is absolutely okay to not care whether or not the
string refers to a correct recipient. Most places that make me write my email
have no business caring about it. But please also then make the field
optional.

~~~
Invisig0th
Regex validation cannot verify that an email address is functioning any more
than it can verify that a phone number is functioning. But if an address or a
number passes formatting checks, one can indeed consider that data
"validated". That's what that term means in the business software development
industry. We're verifying if the data entered COULD be correct, not double-
checking that it IS correct.

You are correct to say that those emails may still bounce, and the phone calls
may also not go through. We completely understand that. For this reason, in
very specific situations (like registering a new user), we do take that extra
step to make sure the communication channel actually works. But there are
plenty of situations where that makes absolutely no sense, and/or adds very
little value for the cost. Knowing the difference between these two very
different use cases certainly does not indicate that these people "don't
actually care" about the accuracy of their data.

~~~
kqr
And what I'm saying is that either the system _relies_ on functioning email
addresses, in which case they need to be ensured to work anyway. ...or the
system does not rely on working email addresses, in which case drop the
pretense, make it an optional field and peoples' days will suck less.

Besides the practical issues mentioned, it's this no-brained "why the heck
not" collection of personal data I'm turning against. Either you need the
stuff and then you have to work for it, or you dont and then you have nothibg
to do with it.

------
seven800
I created a little web app to generate strings which match a given regex:
[http://regexicon.com](http://regexicon.com)

It's useful to ensure you're not matching things that you don't mean to, which
is a problem I often see when code reviewing regex.

------
souriguha
OP, your link to [http://regex101.com/](http://regex101.com/) is broken. It's
my go-to web-app too.

~~~
quotable_cow
Thanks! It's fixed now.

~~~
Sean1708
Also the link in this paragraph:

> The source code for the examples in this tutorial can be found at the Github
> repository here - [https://github.com/triestpa/You-Should-Learn-
> Rege](https://github.com/triestpa/You-Should-Learn-Rege)

has a `href` of `""` which is making it point to the blog post itself.

------
Tepix
Regular expressions are super powerful and a great tool to have at your
disposal. The best resource on the topic is the book "Mastering regular
expressions" published by O'Reilly. Understanding the inner workings of a
regular expression engine will improve your regular expressions. If you use
regular expressions a lot, mastering them will be worth it.

------
yenwel
Yea learn regex, so you know when not to use it. For parsing non regular
languages like XML...

------
oldoverholt
The OpenBSD re_format man page is a quick read and super clear and helpful:
[https://man.openbsd.org/re_format](https://man.openbsd.org/re_format)

------
not_kurt_godel
This sounds like an article I would have written in college. In my experience,
if you're using regex often enough to remember how to write patterns that are
sufficiently complex to actually be useful beyond short 'pre-processing'
snippets, you're overusing regex. (Or alternatively, your memory is
significantly better and/or less divided than mine.)

------
habeebtc
"Now you have two problems" . ;)

------
henesy
The Sam language and editor may be of interest:

[http://sam.cat-v.org](http://sam.cat-v.org)

[http://doc.cat-v.org/plan_9/4th_edition/papers/sam](http://doc.cat-v.org/plan_9/4th_edition/papers/sam)

------
le-mark
Kleene Algebra, generalized regular expressions, for the adventurous [1].

[1]
[https://en.wikipedia.org/wiki/Kleene_algebra](https://en.wikipedia.org/wiki/Kleene_algebra)

------
Endy
I'm not a programmer, but I use Notepad++ and mp3splt, because I have uses for
them. This might actually help me use the search functions more effectively.

------
ndr
Some years ago I went through Regular Expressions Cookbook, By Steven
Levithan, Jan Goyvaerts and can't recommend it enough.

------
j_m_b
...but..but.. then you'll have two problems!! right guys?

------
fredrb
Section 0:

`[0-9] - Matches any digit between 0 and -`

Mistype the `9` for a `-`. :-)

~~~
quotable_cow
Thanks, good catch!

------
papey
Why the fuck on earth, people cat and pip stuff to grep ? grep cat take files
from args !

~~~
yorwba
I cat files into programs that can take their inputs both from stdin and a
given file, although I know people call that "useless". I do it because it
makes my pipelines more uniform, by keeping the direction of data flow from
left to right. It also makes it easier to insert additional processing steps,
or changing the input from a file to the output of a program. Maybe I'm a fool
for writing Bash like Forth, but I prefer it that way.

~~~
asicsp
you can still do `< test.txt grep ...`

see
[https://en.wikipedia.org/wiki/Cat_(Unix)#UUOC_.28Useless_Use...](https://en.wikipedia.org/wiki/Cat_\(Unix\)#UUOC_.28Useless_Use_Of_Cat.29)

>Beyond other benefits, the input redirection forms allow command to perform
random access on the file, whereas the cat examples do not. This is because
the redirection form opens the file as the stdin file descriptor which command
can fully access, while the cat form simply provides the data as a stream of
bytes.

commands like `sort` are optimized to handle large input file

and how would you do `grep -l 'foo' *.txt` if you use `cat`?

or `awk 'NR==FNR{a[$1]; next} $1 in a' file1 file2`

or `grep -Fxf file1 file2` and so on....

