Hacker News new | comments | show | ask | jobs | submit login
You Should Learn Regex (patricktriest.com)
90 points by quotable_cow 9 days ago | hide | past | web | 47 comments | favorite





When I used to code in Perl (more than ten years ago) I tended to solve many many things using regexes. I tried to reduce many programming challenges into transformations that I could do via regexes. Why? Because regular expressions are first class citizens in Perl. And Perl made me learn and use them. (Reminds me of a an APL programmer that saw vectors everywhere.)

When I replaced Perl with Python in my toolbox I abandoned regexes altogether because I didn't know too many Python programmers using regexes.

Now a decade later I am realizing how often I do need regexes in my day to day tasks (using them in bash commands, in my editor, and even in Python sometimes).


I found the same thing. When I was writing Perl half the point was using regex. I inherited a site about 5 years ago that was a huge Perl site built entirely on text files and regex. The developer had basically built a database out of text files. He had directories for indexes, associations, etc.

It wasn't pretty to look at but it was still to this day the fastest site I've ever worked on.


How did he achieve transactionality?

Anywhere that he'd need transaction safety in the system, everything was contained in a single file.

There were other jobs that could rebuild indexes by scan, repair associations or break things down into parts of use on other areas of the site but the areas that needed safety were lumped together as a single record (similar to the NoSQL approach honestly).

It wasn't a general purpose setup, but it worked for his purposes.



Great read. Thanks for sharing.

Regex patterns might not be first class citizens in python, but they're easy enough to use, and I think they're ultimately more powerful than in Perl. I don't remember being able to use a callback to calculate my replace string as a function of the match results when doing substitution in Perl. It's super easy to do tokenization with python regexes. We lose first class patterns, but gain first class functions and classes :)

> I don't remember being able to use a callback to calculate my replace string as a function of the match results when doing substitution in Perl.

That's easy with the `e` (eval) flag.

    say '13 37 123 42' =~ s
        { ( \d+ )           # get all the numbers }
        { my $total += $+;  # replace with running total }egrx;
    __END__
    output is 13 50 173 215
> I think they're ultimately more powerful than in Perl

No. Perl regex are more powerful than Python because

1. the programmer can pick from a large selection of delimiters

2. the full-featured engine is built-in and ready to use, but in Python you need to install the third-party "regex" module because the built-in one named "re" is hopelessly dyd¹

3. you may enable verbose (readable) character classes

4. the regex subsystem is pluggable with a common interface and you can within a lexical scope switch at run-time with more restricted implementations which have less features but better performance in some cases: https://metacpan.org/search?q=re%3A%3Aengine%3A%3A

5. you can embed arbitrary code with (?{...}) and (??{...})

> We lose first class patterns, but gain first class functions and classes :)

Perl had first-class functions since 1993.

You love having first-class classes in Python, but do not realise the downside. Python's one true meta object system (with its ageing design from the last millennium) is flawed: horizontal composition of methods with the same name does not emit a warning. Since the system is baked into the language, it cannot be changed. Contrast with Perl, where the language only provides some primitives on which to build a meta object system (incl. first-class classes). This enables bugs to be fixed, and the competition among several implementations on CPAN breeds excellence.

¹ https://i.imgur.com/66z4KDA.jpg


Not to mention that writing a composable first-class API to contruct regexes should be fairly elementary in Python.

oh what a good idea... wait, surely someone's already built this, right? like sqlalchemy's expression language, but for regex. If noone's built that yet, I'd build that.

One website I found extremely useful when initially learning regex was http://regexr.com. It's got a great pattern builder, and hovering over each component explains what its actually doing.

For those who prefer a more in-depth introduction than a blog post: Jeffrey Friedl's "Mastering Regular Expression"[1] covers the theory as well while still being very pleasant to read (I only know the first edition, apparently there are more recent ones).

[1] http://shop.oreilly.com/product/9781565922570.do


haven't read/processed the article fully, some nitpicks:

* `\b([01]?[0-9]|2[0-3]):([0-5]\d)\b` you are using both [0-9] and \d, is that intentional?

* `cat test.txt | grep -E "^[0-9]+$"` UUoC and double quotes subjected to shell interpretation, should be `grep -E '^[0-9]+$' test.txt` or `grep -xE '[0-9]+' test.txt`

* `(?i)` won't work with `grep -E` (or at least not for me on GNU grep). features of BRE/ERE and PCRE like regex are very different - see https://unix.stackexchange.com/questions/119905/why-does-my-...

* avoid parsing ls - https://unix.stackexchange.com/questions/128985/why-not-pars... , use glob/find

* `.<star>?` again, this regex feature is not available with BRE/ERE, your example just happens to work. you can check it with `echo 'abc foo 123 bar 123' | perl -pe 's/foo.<star>?123//'` and `echo 'abc foo 123 bar 123' | sed -E 's/foo.<star>?123//'` (using <star> to avoid formatting issues)


Thanks, that's all very useful feedback.

for the ls parsing example, you can do

shopt -s nocaseglob

ls ~/Downloads/*.{png,jpg,jpeg,gif,webp}

----

for command line tools (grep/sed/awk/sort/etc), you can refer my ongoing project (https://github.com/learnbyexample/Command-line-text-processi...) as resource :)


I just have to ask. Doesn't the above comment kind of defeats the article premise?

A very extensive article, and it shows how powerful RXes are.

It makes a point, in "8.3 - For Problems That Don't Require Regex" that RXes should not be overused when a simpler solution exists.

But there are also problems that are better solved without RXes, but with a "parser" instead. The article mentions parsers, fails to mention that self-made parsers also provide solutions in the same problem space as RXes. Some communities (like the Haskell community) prefer parsers as they allow to be inspected, tested and typed. Parsers provide a more robust and hackable solution, that is possibly faster than the equivalent RX.

Here a short tutorial for a popular parser library in Haskell, to get an idea of what it's like:

http://akashagrawal.me/beginners-guide-to-megaparsec


> From validating email addresses

Guys, no. Send an email. https://davidcel.is/posts/stop-validating-email-addresses-wi...


The article you link to addresses just the one use case where you are registering a new user. Email addresses in business applications are just as often NOT the address of your user. Sales contacts, managers, points of contact, customer support addresses, etc. -- none of these should ever be validated by sending an e-mail. So you still need to validate the hard way in plenty of scenarios.

...but unless you actually send an email it is still just as unvalidated as it was before. Regexes aren't a tool to determine that a string refers to a mailbox capable of receiving mail.

If you use regexes (or any other method of that does not send emails) all you're saying is that you don't actually care whether or not the string points to a recipient (much less the correct recipient).

And don't get me wrong. It is absolutely okay to not care whether or not the string refers to a correct recipient. Most places that make me write my email have no business caring about it. But please also then make the field optional.


Regex validation cannot verify that an email address is functioning any more than it can verify that a phone number is functioning. But if an address or a number passes formatting checks, one can indeed consider that data "validated". That's what that term means in the business software development industry. We're verifying if the data entered COULD be correct, not double-checking that it IS correct.

You are correct to say that those emails may still bounce, and the phone calls may also not go through. We completely understand that. For this reason, in very specific situations (like registering a new user), we do take that extra step to make sure the communication channel actually works. But there are plenty of situations where that makes absolutely no sense, and/or adds very little value for the cost. Knowing the difference between these two very different use cases certainly does not indicate that these people "don't actually care" about the accuracy of their data.


The point is the vast majority of regexes attempting to validate an email address will produce false negatives, reject valid email addresses. If a regex must exist, it should strip all whitespace characters first, confirm there is a single '@' with one or more characters before it, and two or more characters after it (the davidcel.is article mentions checking for a dot but a@us could be a valid address that would fail that test); it should not balk at character sets other than ASCII.

If you want to do some additional non-regex validation, like confirm the hostname exists and has an MX record, have at it.


And what I'm saying is that either the system relies on functioning email addresses, in which case they need to be ensured to work anyway. ...or the system does not rely on working email addresses, in which case drop the pretense, make it an optional field and peoples' days will suck less.

Besides the practical issues mentioned, it's this no-brained "why the heck not" collection of personal data I'm turning against. Either you need the stuff and then you have to work for it, or you dont and then you have nothibg to do with it.


Depends on the use case. For forms, a preliminary sanity check doesn't hurt, e.g. when the user enters their username and password, instead of email & password (accepting either would be best, but you can't always do that).

Mailcheck is great for this. People constantly make typos when entering emails. This utility suggests john.doe@gmail.com if they type in john.doe@gnail.com. But, it's their choice...it just suggests corrections.

My logs show that it helps a lot. Bounced emails from orders dropped by more than half.

https://github.com/mailcheck/mailcheck/blob/master/README.md


In fairness, the author does make note of that:

Note - In a real-world application, validating an email address using a regular expression is not enough for many situations, such as when a user signs up. Once you have confirmed that the input text is an email address, you should always follow through with the standard practice of sending a confirmation/activation email.


That works if you have content that you know is an email address.

If you have free-form content, and need to identify email addresses in it, then you can’t just email every subset of that text.


> That works if you have content that you know is an email address.

That's what validation is. You're talking about finding email addresses, which I agree that regex is relatively OK at doing, but there are still lots of edge cases in the spec that are irregular.


I created a little web app to generate strings which match a given regex: http://regexicon.com

It's useful to ensure you're not matching things that you don't mean to, which is a problem I often see when code reviewing regex.


OP, your link to http://regex101.com/ is broken. It's my go-to web-app too.

Thanks! It's fixed now.

Also the link in this paragraph:

> The source code for the examples in this tutorial can be found at the Github repository here - https://github.com/triestpa/You-Should-Learn-Rege

has a `href` of `""` which is making it point to the blog post itself.


Regular expressions are super powerful and a great tool to have at your disposal. The best resource on the topic is the book "Mastering regular expressions" published by O'Reilly. Understanding the inner workings of a regular expression engine will improve your regular expressions. If you use regular expressions a lot, mastering them will be worth it.

Yea learn regex, so you know when not to use it. For parsing non regular languages like XML...

The OpenBSD re_format man page is a quick read and super clear and helpful: https://man.openbsd.org/re_format

This sounds like an article I would have written in college. In my experience, if you're using regex often enough to remember how to write patterns that are sufficiently complex to actually be useful beyond short 'pre-processing' snippets, you're overusing regex. (Or alternatively, your memory is significantly better and/or less divided than mine.)

"Now you have two problems" . ;)

The Sam language and editor may be of interest:

http://sam.cat-v.org

http://doc.cat-v.org/plan_9/4th_edition/papers/sam


Kleene Algebra, generalized regular expressions, for the adventurous [1].

[1] https://en.wikipedia.org/wiki/Kleene_algebra


I'm not a programmer, but I use Notepad++ and mp3splt, because I have uses for them. This might actually help me use the search functions more effectively.

Some years ago I went through Regular Expressions Cookbook, By Steven Levithan, Jan Goyvaerts and can't recommend it enough.

...but..but.. then you'll have two problems!! right guys?

Section 0:

`[0-9] - Matches any digit between 0 and -`

Mistype the `9` for a `-`. :-)


Thanks, good catch!

Why the fuck on earth, people cat and pip stuff to grep ? grep cat take files from args !

I cat files into programs that can take their inputs both from stdin and a given file, although I know people call that "useless". I do it because it makes my pipelines more uniform, by keeping the direction of data flow from left to right. It also makes it easier to insert additional processing steps, or changing the input from a file to the output of a program. Maybe I'm a fool for writing Bash like Forth, but I prefer it that way.

you can still do `< test.txt grep ...`

see https://en.wikipedia.org/wiki/Cat_(Unix)#UUOC_.28Useless_Use...

>Beyond other benefits, the input redirection forms allow command to perform random access on the file, whereas the cat examples do not. This is because the redirection form opens the file as the stdin file descriptor which command can fully access, while the cat form simply provides the data as a stream of bytes.

commands like `sort` are optimized to handle large input file

and how would you do `grep -l 'foo' *.txt` if you use `cat`?

or `awk 'NR==FNR{a[$1]; next} $1 in a' file1 file2`

or `grep -Fxf file1 file2` and so on....




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: