Hacker News new | past | comments | ask | show | jobs | submit login
Regular Expressions make me feel like a powerful wizard: that's not a good thing (shkspr.mobi)
71 points by mmastrac on Feb 9, 2023 | hide | past | favorite | 75 comments



Well, maybe if you wrote it like this:

mandatory_leading_letters = "^\w+"

optional_suffix = "([-+.']\w+)*"

domain = "\w+"

domain_suffix = "([-.]\w+)*"

tld = "\.\w+([-.]\w+)*$"

regex = "{mandatory_leading_letters}{optional_suffix}@{domain}{domain_suffix}{tld}"

Then you could understand it? Seems to me the trouble isn't with regex but with the decision to write a regex without trying to make it understandable. It is also possible to minify your onto one line in some languages or otherwise obfuscate it; should we then condemn it?

If the author had to fix a multi-line regex, I hope that in the process, they understood it enough to break it up into pieces so that the next time it would be more possible to debug.

Regex are pretty neat. They are often not the solution. I would not want to use lookaround for example. But also, they are really useful in a lot of cases, and I would hate to not use them because they're capable of being obfuscated.


Yes!

I have been writing regexes like this for 20+ years. The "well then just don't write line noise...?" realization comes easily for those of us that spent a few years "maturing out of Perl". Because first you stop writing Perl line noise, and then you realize that regexes are just another part of that line noise (because regex are so common in perl code).

Regular expressions DEFINITELY ARE the right tool if the problem that you have is that you need to parse a regular grammar (or near-regular grammar)!!!

Like, literally, they are 100% the right tool. Theoretically. Practically. Everything. The. Right. Tool.

And they are also the absolutely, terribly, completely wrong tool if you need to do anything other than parse a (very nearly) regular grammar.


Even this is unnecessary. I'd say 99% of people using PCRE-like regexes have never read the documentation and realized `(?x)` exists.

We can write the regex like this, including all whitespace:

    (?x)
    ^\w+            # mandatory leading letters
    ( [-+.'] \w+ )* # optional suffix
    @
    \w+             # domain
    ( [-.] \w+ )*   # domain suffix
    ( \.\w+ ( [-.] \w+ )* )* #tld
    $
Also, btw, I hope no one is really using this regex. It's wrong; for example it appears to be deliberately designed to fail on IDNs.


That's less readable than the original suggestion though.


Out of context on HN, it looks so. But in my experience, the original suggestion with the separate variables gets pretty hard to keep track of, and ends up making things more confusing. The `(?x)` or `/.../x` version makes it much easier to see things in their place and understand the overall pattern.


That's pretty neat, indeed I'd never heard of this, or maybe I had once but I'd forgotten about it if so. I might counter that in a very long regex you'd want to do the assembly in multiple steps, but I can definitely see the appeal of inline comments (perhaps in addition to something like that). I'll try to keep this in mind for the next time I'm dealing with regex. Thanks for sharing!


Outstanding method for commenting regexes. I had no idea about (?x). Thanks for this - I'll adopt it going forwards.


You know that's kind of a fantastic way to do it. Really. How I usually did it was copy the best possible regex from Stack Overflow and see the best possible one that passed my unit tests. Though I really don't know a whole lot about regexes beyond the utmost basics.


I like the example using well-named local variables to put together the regular expression into something really readable. It reminds me of the chapters in Uncle Bob's Clean Code book on naming using functions and local variables.

When reading the code, unless you're specifically debugging the regular expression, you don't really need to get into the details of validating what the regular expression is doing.


+1 this.

I only recently started using it exclusively to parse grammar, when the rules got complex so did my regexes, but when I divided my regexes into groups each on their own variable and I formed whole a regex by concatenating these variables, regex was much more concise for me.


I love using regexes whenever I need a way to parse HTML. Also whenever I parse nested parentheses, brackets, or braces.


These sorts of screeds against regexes always strike me as very strange: regexes can be written in the terse, difficult to read manner that the author demonstrates, but they can also be written with inlined comments, named capture groups, and plenty of whitespace, so that they're actually readable by normal humans. The former is useful for when you are sitting at a terminal and need to sift through gigabytes of log files in an ad-hoc manner while debugging an issue, the latter when you want to commit something to a codebase. Confusing these two use-cases, and then writing a few hundred words of inflammatory flamebait about it, seems unfair to what are otherwise an extremely powerful and versatile tool on the one hand, and a relatively unobjectionable[1] way to write software on the other.

[1] I think that sticking even moderately complicated regexes in a modern codebase is probably not a great plan most of the time (lots of modern languages have easier-to-use options for accomplishing complex parsing tasks!), but given that regex libraries typically ship as part of the standard library and are extraordinarily well documented as a language, they're not a bad tool to reach for in a lot of cases.


I've always struggled writing regexes. It's somehow counter intuitive for me. Luckily I have friends who just adore writing them and will help me when I get stuck. Thought I had a killer idea of creating a regex generator using AI. Well someone beat me to it.

https://www.autoregex.xyz/

I have also heard of another one as well. I suspect it will be quite a popular project for awhile. I wouldn't care if I never had to write another regex again.


And usually where you have a complex regex, you can extract it into a stateless util method and then unit test the hell out of it. Which is pretty useful to aid in understanding.


My take is that most regexes are easier to read than the alternative code. You do have to step through them and think about what they match, but the same is true for any other code.

> the very existence of RegEx101.com ought to bring shame on our industry.

> ...

> You have a desire to build something hard to debug.

Regex101 is there to make it easy to debug regexes. It would be MUCH harder to debug a complex parsing function than a regex with Regex101.

> You don't trust compilers.

I don't even understand the idea here. You need a compiler for the regex. Why would someone that doesn't trust compilers use regexes?


> It would be MUCH harder to debug a complex parsing function than a regex with Regex101.

I encountered, this week, a 42-line function that determines if a number is a valid string representation of a double. The function is wrong and has been wrong for a decade or more. The code is obtuse using unclear state variables (multiple boolean flags) to accept or reject the string at various points. It could have been a regex, and we could have seen at (almost) a glance what was intended. There are several other functions in the same file doing similar things so this ended up being something like 300+ lines of both wrong and difficult to understand code that could have been maybe 30 total lines of relatively easy to read and debug regexes.


Yeah, I'd definitely replace that whole function - with a try/catch-wrapped (or NaN-checked, etc.) attempt to parse the string as a double. Not a regex, which then still would have to be tested to make sure it's exactly as strict as whatever parser it's going to be passed to, anyway.


Fair, but I'm not even sure where this value ends up yet so I don't want to parse it (vice validate it), just getting it to reject "1-2" as a double would be a start (currently accepted, note that it's not accepting expressions just values). Stage one of fixing this thing is just tackling obviously wrong code like this, their creative shared pointer implementation that's also wrong (the memory will be freed while other pointers still have access to it, and the reference counter is not in the shared pointer but the object it contains so checking the reference counter at that point will result in use-after-free errors), and their fantastic use of iterators that will crash the program if that code is ever called (they didn't understand iterator invalidation). And yesterday I found that they read a file (in its entirety!) one character at a time, tacking the results onto the end of a string. There is no parsing logic at that point, it's literally just building a giant string. And all of that is in what should be straightforward logic any trained novice can handle, the real critical code which solved a somewhat novel problem (not novel novel, but not an everyday problem) gets even more creative.


By the sound of it, this may well indeed be the rare case in which jwz's dictum fails to hold, and using a regex will yield one problem fewer - or, failing that, maybe kick a hole in the side and go looking for a river or two. In any case, good luck...


I agree, I find regular expressions to be the clearest way to describe families of strings that share a specified structure and no longer find them hard to read. This sort of argument has always struck me a bit like the way people will say “Greek is hard to read because you have to learn the alphabet”: learning the alphabet for Greek isn’t the hard part of learning Greek and, similarly, learning the meaning of the characters used in regexes isn’t the hard part of using regexes. The hard part is learning to think in the language that uses those characters.


Regular expressions are virtually always easier to read because they're so much terser. I unroll a fair bit regular expressions because they are typically an order of magnitude slower, and it's almost never an upgrade in readability.

Compare for example something trivial like

    [a-f0-9]{32}
with its unrolled form

    public boolean hashTest(String path) {
        int runLength = 0;
        int minLength = 32;

        if (path.length() <= minLength + 2)
            return false;

        for (int i = 0; i < path.length(); i++) {
            int c = path.charAt(i);

            if ((c >= '0' && c <= '9') || (c >= 'a' && c <= 'f')) {
                runLength++;
            }
            else if (runLength >= minLength) {
                return true;
            }
            else {
                runLength = 0;
            }
        }
        return runLength >= minLength;
    }


> It would be MUCH harder to debug a complex parsing function than a regex with Regex101.

Use a parser combinator library and your parser is much easier to debug, because everything's compositional and you can unit-test it.


Yeah, when complaining about multi-line regexes, show us the alternative imperative code. Is it really easier to understand and maintain?


All this article tells me is that the author doesn't really understand regular expressions. The @ sign and escaped . make it pretty clear we're probably trying to parse an email.

Why do we allow all this reductive "why do we make this so hard" drivel? Why is it that people's attitude isn't, "wow, something I don't know, I should learn what this is..." instead of "STUPID HULK SMASH!!! HULK BRAIN HURT!!!! OWWWW!!!!"

Like, I'm sorry the author sucks. He doesn't know regular expressions very well as parts of his post are just plain wrong, and his regular expression itself looks unnecessarily convoluted.

Regular expressions exist because string parsing code fucking SUCKS. I started programming in Visual Basic, where there are no regexes and string parsing was a verbose hell of substring processing, array indices, and equivalency checks. Terse regular expressions can replace dozens of lines of word vomit, and the basic library of symbols isn't that hard to memorize. It is WAY easier to keep a regular expression in your head than the alternative.


Completely agreed. Just a very basic understanding of regex has saved me a very significant amount of time throughout my entire ~8 year career. I’d say it’s one of the fundamental skillsets that I’ve leaned on consistently. This post just gives me “it looks scary therefore it’s awful” vibes.


Another thing I don’t get is that there’s not much to learn really. Basic regex knowledge covers 90% of them and requires reading a page or two.


> There's no space for comments.

For every non-trivial regex I write, I comment it heavily. I just put it together using string concatenation, like:

  var r = "/[...]" +    // comment goes here
          "..." +       // another comment
          "(" + 
            "..."       // yet another comment
            "|..."      // and another
and so on. Works like a charm, and I make sure to indent on the parentheses as well.

Also in the past when a regex has various parts repeated, I'll define regex "parts" as variables, and then use those.

You can make regexes as clear as you like.


Perl makes it easy to do as well.

    # Delete (most) C comments.
    $program =~ s {
      /\*   # Match the opening delimiter.
      .*?   # Match a minimal number of characters.
      \*/   # Match the closing delimiter.
    } []gsx;
And I always do that to remind myself what I am doing it for.


Yes, perl regex options are great for commenting and overall it's one of best things about the language.

I also will paste one or more commented example lines of above my regexes so it is clear what kind of input you are expected to be processing, it always ends up being a time saver when debugging or updating code.


I learned regex on perl, and the community (at the time) was always so good about commenting their regex strings in a similar fashion.


It is beyond me how people continually compare things that are (supposedly) hard to read because of HIGH semantic density (regex/perl/APL/...) with things that are hard to read because of LOW density (brainfuck). It is a pretty terrible comparison but sadly people seem to do it a lot. Is is not obvious just from a quick glance that they're completely different?!

Anyway, the regex in question is extremely easy to read (the clue is in the name regular) (that's a word, followed by any number of groups that are one of these characters followed by a word... etc - you get the point). I think that regex is probably one of the easier ways of expressing the idea.

Regexes aren't 'skimmable', but they are (in a lot of cases, like this one) readable (because they are, yknow, pretty regular), so reading it slowly should be possible for anyone who knows what the basic rules are, which there aren't many of.


Perl qr//x equivalents aren’t available in most languages sadly.


The author is more than welcome to write the equivalent code themself, which is probably going to be more unreadable (if they derive the automaton themself) or much slower (if they are doing redundant tests or pre-processing that the automaton could have saved them from). Regular expressions are also more amenable to static analysis in both linting and optimisation than equivalent code, being more declarative in nature; a regex and an imperative matcher have a relationship like SQL and the query plan generated by a database.


Actually, I feel that the parsec syntax in Haskell, for example, is much more readable - using many, noneOf etc.

The chapter of the book "Real World Haskell" builds a simple CSV parser:

https://book.realworldhaskell.org/read/using-parsec.html

It may be adaptable for regex and for static analysis as well.


Parser combinators are one of those rare REALLY great things that almost entirely replace something else (regex). Sadly, way too verbose in many languages.


Regular expressions can be written verbosely and with comments; you don't need to write them like line noise, any more than you need to do so with normal code. Regular expressions also have the significant advantages of being both pure and guaranteed to terminate. There's even libraries (like re2) that guarantee termination in linear time.

It's insane to write code to text search when a regular expression could be reasonably used instead. This article is objectively bad advice and should not be followed.


Emacs is heavily regex-centric (though tree-sitter support may be slowly changing this), and it suffers badly from escaping problems when writing them, so it's not surprising that it has a built-in library to generate regexes in a much more readable fashion:

https://www.gnu.org/software/emacs/manual/html_node/elisp/Rx...


Rx is a perfect thing to bring up in this discussion, and I encourage everyone to check it out. To me, Rx showcases two things:

1. How a high-readability but consistent, structured, sugar-free syntax for regular expression looks like - as any code you'll hand-roll to replace your regexes will be strictly inferior to feeding Rx expression to a regex engine.

2. Why you'll still want to use plain regular expressions anyway. Rx expression grow large very quickly, so for any non-trivial problem, you'll quickly reach the point past which it's less readable than the raw regexp, by virtue of sheer size. And, again, whatever your language, your replacement for a regular expression is unlikely to beat Rx.

Those two points together form an argument that regular expressions are often the right tool for the job, and while seemingly requiring more concentration up front, they'll be easier to work with due to lower demand on your working memory.


There is a third point that trumps these two: having a structured format allows you to programmatically compose regular expressions.

In most string encoding it is hard or unergonomic to safely embed regexes or string literals in other regexes.

In this sense a regex is quite similar to SQL, just used for simpler operations generally.

It is to be noted that while regexes are used to parse (mostly) regular languages, the language of regex expressions is fully context-free.

Personally I dislike manually embedding context-free languages inside a regular encoding (strings) already embedded inside another context free language that could have just added support for structured regexes in the first place.


> Rx expression grow large very quickly, so for any non-trivial problem, you'll quickly reach the point past which it's less readable than the raw regexp, by virtue of sheer size.

That's not even the case. You can define and compose rx expressions. This allows you to build larger more complex rx expressions from smaller simpler ones in much the same way as you manage the complexity of a large program by building it up from subroutines.


You can compose raw regexes too. There's a bit more things to keep track of than with composing clean s-expressions, but you can do it, and if you try, you'll hit the same problem as with composing Rx.

> build larger more complex rx expressions from smaller simpler ones in much the same way as you manage the complexity of a large program by building it up from subroutines.

Yes, and with it comes a problem: factoring out legos and composing the solution out of them reduces complexity, but it sacrifices locality. There's no free lunch: some things get easier because you get to ignore irrelevant detail, other things get harder, because the relevant details are all over the place. Humans have a limited working memory, so too much composition, or factoring along the wrong dimension to the problem you're solving, destroys readability - the solution no longer fits in your head, and you keep chasing pointers, constantly evicting one piece of the puzzle from your head to fit another one.

In this context, terseness and locality become desirable qualities. They make you expend more cognitive effort up front, but save you the working memory overhead of extra abstractions that come with composable pieces, and eliminate the cost of pointer chasing, as the whole thing is literally in front of your eyes all the time.

This is, to my understanding, the actual reason math-heavy fields (including mathematics itself) stick to dense equations built of single-character names chosen from several alphabets - it ultimately saves time. A single line in a math paper may fully express a complex thought which, were you to rewrite it in "clean code" style, would take 10 pages and involve several extra layers of abstractions.

I'm not advocating we should rewrite everything in APL (though I'm also not convinced it wouldn't be better on the net) - abstraction and composition are the fundamental tools that let us deal with complexity. But they have a cost, and sometimes that cost is too high. Regular expressions are, in my experience, usually a case of that.


rx is a great example here. Add in the fact that you can manipulate rx forms with macros and you're dealing with some serious power.

With that said, I think the most ergonomic string searching tool I've used is the PEG implementation in Janet:

https://janet-lang.org/docs/peg.html


It takes some time to read a RegEx, but I find them quite useful because the alternative would be a bunch of if statements, splits, looping and whatever just to do a match on a string. It's very neat and concise!


a bunch of if statements, splits, looping and whatever

Also few off by ones and edge condition bugs lurking in “whatever”.


I agree with this guy to the point that it depends on the use case. If i’m writing something in a general purpose language I 100% avoid regex. Why? My brain has to switch contexts from something I know well to something arguably more efficient. To me that breaks the flow and I don’t need any help with that.

Where it makes sense to me is for instance ISS rewrite rules.

Just my 2 cents.


I think that regex is beautifully simple and communicates simple computations elegantly.

The problem is that they are often not use for simple computations, and therefore don't appear elegant.

They are, quite literally, like a flowchart. If it gets to be too big, then it gets difficult to understand without adding context to the parts (as some sibling comments have mentioned, by labeling the parts or by adding comments outside of the string definition).

Understanding Regex (and, by extension, DFAs) is very helpful, depending on your job. I used to teach Theory of Computing, and students would say "I'll never use this in the real world!", and yet only yesterday I was putting these very skills to use in my industry job in validating contract stipulations of a system.

I don't think it makes me feel like a powerful wizard... rather, I think I'm getting to spend time admiring a beautiful art exhibit!


Regexper is the most useful regex explainer I've found. It gives you a railroad tracks diagram for the RE. Here is the Regexper link for the RE in the article [1].

[1] https://regexper.com/#%5E%5Cw%2B%28%5B-%2B.'%5D%5Cw%2B%29*%4...


A couple of years ago I worked on a project that involved parsing the very idiosyncratic output of a lot of different command-line commands. Not standard or semi-standardized commands like cat, ls, etc. - a vendor-written command line interface running on a microcontroller, for testing a board. Because apparently a number of different engineers worked on it, all the commands had "ad hoc" output formats, with very little commonality in the way the output was formatted, so there was not a lot of opportunity to write common handlers. I wound up using a lot of Python regular expressions that captured specific fields in the output, handling IIRC about forty different commands.

The best I could do was to break the regexes down across multiple lines and comment each line, and break the handlers down into a hierarchy of smaller functions. Each function had a number of sample strings right in the code, and each file of code for handling the variations on output of a single command also had a unit test right in the same file which fed all the samples to the regexes. So the handling of the 40 different commands could be unit-tested very quickly with a single pytest command line. This was a godsend to help keep my change things-test things-change things again "loop" very quick.

The alternative might have been to actually write parsers, but I don't think that would have been more readable and it certainly wouldn't have been as quick.

I don't think language designers should get rid of regexes and I don't think programmers should stop using them. I do think there is definitely room for languages that implement regexes with more syntax. People might argue that this is just needless syntactic sugar, but so is spelling out keywords like "else." We ought to be looking at prior art. The history of programming languages is a very deep trashpile containing a lot of gold that can be mined with a little effort. SNOBOL's syntax for pattern matching was maybe too wordy by modern standards but I'm sure a lot of younger programmers would be pretty surprised to find how early a lot of things like regexes were invented, and how inspirational it can be to look at old languages.


Regex is a tool. You mostly just need to understand how it works and when it is useful to get value out of it. By virtue of having used it for a few years, I can generally read it well. I can do the regex crosswords. But I've used it effectively since the day someone told me what it was and I am not a genius. I just kept a guide near. As long as I knew what result I wanted I could use the guide to get it done. That's why I don't ever agree with these articles on regex being some alien language impossible to learn. It looks weird, we get it, it also kicks ass and takes only a minimal effort to understand an expression if you know that's what you're looking at.


RegEx'es can be likened to Mathematical Set Theory. They can help us group vague unknown data which needs to be processed in some way. And just like someone can get carried away with the over normalisation of a database, the same can happen with complex RegEx's where the time and complexity could be reduced if a single complex RegEx became part of a decision tree like code process using other RegEx's.

Mathematical Naïve Set Theory is fundamentally flawed, otherwise Russell's Paradox wouldn't exist.


Huh? They just make me feel competent at my career.


Or start a career!


I don't really agree with the article. I often break up my complex regexes into different variables that get re-assembled before passing it off to the parser.

The thing is, with regex, maybe worst is better. It's the best we've got (for now), and I can't imagine parsing complex text without it. Manually parsing it (tokenize, etc) without regex is much harder.


It’s funny but I have made the opposite transition recently.

The reasons: Writing parsing code by hand tends to be verbose, error prone (such as off by one errors), ad hoc, and full of temporary strings. Editor syntax highlighting (and also sites like regex101). Lastly regexes are matched in one go (match /no match) which is good for readability.

I do avoid making regexes long if I can.


I honestly feel like EBNF or similar is easier to comprehend. I get that for short things it's easier and more legible to use regex but once you get to complex stuff the verbosity of spelling out a grammar is kind of a boon. And I say that as someone who has at various times been really comfortable writing some terrifying but accurate and useful regex.


EBNF is awkward if you're describing a typical regular grammar. It's very natural for typical context-free grammars.

The dirty secret of CS school is that your theory course is mostly about parsing until you get to turing machines (I joke, but only kind of)


Don't tell him about APL then.


Regular Expressions make me feel like I need to read and practice more theory of computation.


I use ChatGPT to explain or create regexes. It’s ideal because I don’t need it to be 100% reliable as I can and should verify it. But it’s much faster than writing them tediously. And I don’t ever use them often enough to become an expert.


One benefit of regex that isn't mentioned in the article is that you can expect it to be available, at least the basics, in pretty much any programming, scripting, or even querying environment.


I avoid regexes like the plague. I like PEGs as a replacement.


Unfortunatly most stdlib don't have a peg module, and peg libs are complicated.


What would PEG for an email address (like in the article) look like?



I’m aware of PEGs (lpeg specifically), but couldn’t find a real email parsing example myself, hence the question. Most examples on the internet are as cryptic and noisy as regex imo when it comes to pattern matching, so I thought maybe there are other implementations.

I find PEGs useful for actual CFG parsing, but tbh can’t see why I would avoid regex in favor of it in this case.


he lost my attention with this:

> I genuinely - and possibly misguidedly - believe that even something like

    ^\w+([-+.']\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*$
> might just as well be written in BrainFuck.

all those \w & \W's? c'mon, that's the easiest RE parsing task you could possibly confront!

it is really difficult to look at RE expressions, but it's also really hard to think of a better way


Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

Jamie Zawinski (1997)


My first boss taught me this proverb 25 years ago:

Dev who solves problem with regex, now has two problems.


Regex is not only powerful, it's also expensive. If you can solve a problem without regex, it's usually the better way to do it.


What do you mean "expensive"? When I've seen people roll their own state machines to replace a moderately-complex regex, the result is frequently hundreds of bug-ridden lines and rarely matches the performance of a decent regex engine. This laborious learning experience often results in reverting to the regex. It's the ignorance that's expensive.


Depends on your regex engine, and your non-regex solution. My engine (shameless self-plug https://github.com/telekons/one-more-re-nightmare) rivals hand-written automata, having to load each character more-or-less* only once, and throws in vectorisation for simple search loops too. I would not want to write or maintain the generated code.


Bonus points for the King Crimson reference!


It’s very efficient in processing strings since it’s linear time with respect to input string length if implemented properly.


ChatGPT....


Its not that the author is stupid for not 'getting it', its simply that regex is a write-only language.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: