
Getting started with regular expressions: An example - tcarriga
https://www.redhat.com/sysadmin/getting-started-regular-expressions-example
======
macando
For some reason regular expressions have the lowest expiry date in my mind's
cache. I had to relearn them at least 10 times. Watching that famous Udemy
programming course where regex is explained in great detail with examples and
state machine diagrams didn't help.

~~~
kbenson
Regular expressions are a powerful tool, but unless you use them often after
learning them (and are in a language where it makes sense to do so), it can be
hard to make it stick.

I've had amazing success in using a regex in situations where you wouldn't
think it would work as well as other solutions. For example, I've gotten more
than an order of magnitude speedup in parsing a well known simple (but fairly
large) XML data set using regular expressions instead of the fastest XML
parsing libraries I could find at the time. Sometimes the less efficient tool
that does only what you need is much faster than the highly optimized tool
that handles all the special cases that don't matter for the particular job.

~~~
macando
You use whatever works for your case. When I hear parsing with regex I always
think of this legendary Stack Overflow answer :)
[https://stackoverflow.com/a/1732454](https://stackoverflow.com/a/1732454)

~~~
kbenson
Yeah, that's a classic. But it's also mostly about parsing _arbitrary_ HTML,
and that's where "well known" comes in. The data in question looked somewhat
like:

    
    
      <doc>
        <bool name"foo">1</bool>
        <int name="bar">12345</int>
        <str name="baz">test string</str>
        <float name="quux">1.234</float>
      </doc><doc>
        <bool name"foo">0</bool>
        ...
      </doc>
    

but with a lot more fields per-doc. To pull out each <doc> as a string (using
a regex) in a while loop and to then parse the doc into a hash of key-value
pairs (another regex) that are stored into an array is less than twenty lines
of pretty standard Perl, include exception handling and error reporting:

    
    
        my @items;
        my $count = 0;
        while ($item_xml =~ m{<doc>(.*?)</doc>}gsmi) {
            try {
                # process <doc>
                my $item = $1;
                my $i = {};
                $i->{$1} = $2 while $item =~ m{<(?:arr|date|str|bool|int|float) name="([^"]+)">([^<]+)</[^>]+>}gsmi;
                push(@items, $i);
                $count++;
            }
            catch {
                warn "Error :: $_";
            }
        }
        print "$count items found\n";
    

And if you think the regex to pull out the field values is hard to read, it
could always be written in the extended format like so (this is overly
verbose, but you should get the idea)

    
    
        my $field_parse_re = q{
          # Parse opening tag
          < # Start of opening tag
          # Any of the allowed tag types
          (?: arr
            | date
            | str
            | bool
            | int
            | float
          )
          \s # space between tag type and name attribute
          name="([^"]+)" # Save tag name as $1, or first returned item
          > # End of opening tag
    
          # Parse tag contents
          ([^<]+) # One or more characters that are not <, in $2, or second item returned
    
          # Parse enough of the closing tag to make sure we got all the contens
          </
        }smix;
        
        # Now use it
        $t->{$1} = $2 while $ticket =~ m{$field_parse_re}gsmi;
    

In any case, that's a trivial amount of work to beat the fasted XML parsing I
could find (and I surveyed a few libraries) by like 14x, IIRC.

~~~
macando
Being an engineer means finding/building the best tool for the task at hand.
Sometimes it feels great to write a short and effective snippet of code
instead of examining how some bizarre API works.

------
dmonitor
My life can be divided into pre-regular expressions and post-regular
expressions. I get excited every time I find an opportunity to use it, and it
happens a lot

~~~
Zhyl
Regular expressions are one thing I would want to teach non-technical people.
So many day-to-day tasks, especially in normal office work could be made
trivial.

------
mikece
I wonder if, knowing everything we collectively do now, if we had the chance
to re-invent regular expressions would they look any different? Given the use
case I think the complex syntax is inescapable but I can't help but thinking
this could be implemented a bit simpler without losing any power.

~~~
dan-robertson
Perl6 (ie raku) did do this. Sort of. They made them look a bit more like bnf
grammars (or at least the way those grammars are typically written).

They ignored whitespace (and allowed them to be written over multiple lines,
with comments) which made them more readable and allowed referencing regexps
(stored in variables) directly within a regexp. I think they also made some of
the syntax nicer (eg non capturing groups or lookaheads), and added notation
for common things (e.g. like X* except everything is separated by commas) and
long names for things like character classes.

~~~
tyingq
The /x modifier in Perl 5 allows for comments and ignores whitespace. Things
like [[:upper:]] are also in Perl 5.

------
austincheney
Things I love about regex:

* The global or _g_ switch. The global switch applies a pattern to all matches in a string, which can be used to return all pattern matches or apply all pattern changes.

* In JavaScript space like characters are represented by the _\s_ switch. There are many characters classified as white space, so it is very nice to have a grouping for them when performing a search and being able to ignore them all if necessary. _\s+_ will match a group of consecutive characters that may be any combination of space, new line, carriage return, form feed, and many other characters.

* Partial matches are helpful. Sometimes you know the pattern you want to search for and you need to replace that pattern with a different pattern without harm specific data in the match. In this scenario I perform a replace method on a string and search by regular expression with the a _g_ global switch. For the replacement result I supply a function name. A regular expression match is a string argument in that function. In the function I can manipulate that result to be something else, even using regular expressions, and return the result, which is inserted back into the target string over the regular expression match.

------
spartas
[https://regexcrossword.com](https://regexcrossword.com)

------
mekane8
I love the real-world example, but I couldn't help but wonder if this article
was more about Sed and Awk than actual regex.

