
Alternatives to Regular Expressions - vezzy-fnord
http://c2.com/cgi/wiki?AlternativesToRegularExpressions
======
Retra
"I am not much of a fan of RegularExpressions. It is too hard to remember what
the symbols stand for, for one. Asterisk for 0 or more repetition, plus for 1
or more repetitions, question mark for 0 or 1 occurrences, brackets
surrounding a set of characters -- who's got time to memorize this kind of
complexity? "

Somebody please tell me this is sarcasm...

~~~
jacobolus
Regular expressions are a very obviously inefficient and confusing syntax,
full of incidental complexity, and to make matters worse, they’re often a bit
less powerful than we want them to be.

The best summary of the problems I’ve seen is from Larry Wall, highly
recommended if you haven’t read it before.

First half of this page:
[http://perl6.org/archive/doc/design/apo/A05.html](http://perl6.org/archive/doc/design/apo/A05.html)

Unfortunately, I think at this point regular expressions are too firmly
entrenched in too many places to properly replace.

~~~
raiph
From the preamble of the page you linked:

> In fact, regular expression culture is a mess, and I share some of the blame
> for making it that way. Since my mother always told me to clean up my own
> messes, I suppose I'll have to do just that.

So Larry aimed to do better in Perl 6. Do you think he is as good at solving
problems as he is at identifying them -- or more specifically, what do you
think of the "Rules" system[1] he came up with for Perl 6?

[1]
[https://en.wikipedia.org/wiki/Perl_6_rules](https://en.wikipedia.org/wiki/Perl_6_rules)

------
draegtun
Unfortunately no examples of Rebol's _parse_ were given. Here are some links
to fill the gap:

[http://www.rebol.net/wiki/Common_Parse_Patterns](http://www.rebol.net/wiki/Common_Parse_Patterns)

[https://en.wikibooks.org/wiki/REBOL_Programming/Language_Fea...](https://en.wikibooks.org/wiki/REBOL_Programming/Language_Features/Parse/Parsing_examples)

[http://blog.hostilefork.com/why-rebol-red-parse-
cool/](http://blog.hostilefork.com/why-rebol-red-parse-cool/)

[http://rebol-land.blogspot.co.uk/2013/03/rebols-answer-to-
re...](http://rebol-land.blogspot.co.uk/2013/03/rebols-answer-to-regex-parse-
and-rebol.html)

[http://www.rebol.com/r3/docs/concepts/parsing-
summary.html](http://www.rebol.com/r3/docs/concepts/parsing-summary.html)

[http://www.rebol.net/wiki/Parse_Project](http://www.rebol.net/wiki/Parse_Project)

------
wyager
Perl-compatible regular expressions are _not_ semantically equivalent to real
regular expressions (as the article seems to claim). In fact, they correspond
to completely different grammars in the Chomsky hierarchy. PCREs are Turing
complete (and for closely related reasons, extremely slow for some
expressions), while regular expressions are isomorphic to FSMs (which are
always linear time in the length of their input).

------
kidsil
I never understood why so many developers have problems with Regex. In my
experience, it's difficult because most people avoid it, once you use it
enough times the logic makes perfect sense.

~~~
banthar
I do not know if it's correlation or causation, but almost all code using
regular expressions I encounter is horribly broken.

1\. Regular expressions are often used instead of existing parsers. XML, CSV,
file paths, URIs etc. all already have fast, well tested and correct parsers.

2\. Often the thing worked on should not be a string in the first place. For
example comma separated string is used instead of a list and regular
expression emulate list operations.

3\. Some things could be processed much easier with different tool - for
example recursive descent parser. Yet the developer still tries to parse
arithmetic expressions with a hammer.

4\. They are often developed via trial and error. They are either first thing
which worked for a simple case or are 10 line monsters riddled with exceptions
from exceptions.

5\. They are often part of hacks and workarounds. For example User-Agent is
matched to work around bugs in browsers.

3\. They give very limited feedback to user. There is either a match or no
match. There is no way to tell what and where is broken.

There are valid use cases for regular expressions. It's even possible to write
correct code with them. It's just a rare sight.

~~~
ivanhoe
Many of these points are correct, but a little comment on #1: Often you don't
care about the whole structure, you just need some small piece of data from a
middle of a huge document. One common example is a spider that collects a
price of some product from a 200+KB webpage. You just need those few digits
and don't care about the head or title or the structure of dom or anything
else. In such cases (and it's very common task for people working on data
extractions) no complex parsers can ever compete with the regexp in terms of
speed and memory footprint. And if you need to parse a few millions of
products that performance gain is a huge deal. So don't underestimate the
power of regexp when properly used...

~~~
banthar
That is fine for throwaway scripts. But such "perl duct tape" is not 100%
accurate and will break for no reason. There is no place for such solutions in
reliable and maintainable software.

~~~
ivanhoe
Why would it "break for no reason"?! For all I know regExps matching one small
piece of the page if far less prone to breaking than parser that has to
analyze the whole page. Designer changes one <div> or id/class somewhere in
the top of the DOM tree and you can't reach the node that you are looking for
anymore. Same goes for regExp of course, but it's looking at a smaller portion
of the html, so it's less likely to be affected by small changes in some
unrelated part of the page. And any major redesign will break any dedicated
scraper, no matter which parser it uses...

~~~
banthar
Lets try an example. Extract first link address from
[https://news.ycombinator.com/](https://news.ycombinator.com/).

As DOM query:

    
    
        document.getElementsByClassName("title")[0].parentElement.getElementsByTagName("a")[1].href
    

This will break:

* When title element no longer has "title" class.

* When title is no longer a sibling of link.

* When link is no longer 2nd link of its parent.

As regular expression:

    
    
        document.documentElement.innerHTML.match('td class="title">.*a href="([^"]*)"')[1]
    

This will break:

* On any white space change.

* On any new attributes on td or a.

* When ' is used instead of "

* When href includes escaped "

* In most cases when DOM query will break.

Many of those can happen without any server-side changes. It will sometimes
works sometimes won't - making it hard to test.

There are cases when regular expression will break less often than DOM but DOM
is easier to reason about, more predictable and has less corner cases.

------
jdeisenberg
Perl 6's regular expressions and grammars appear to be fairly powerful and
useful.

[http://doc.perl6.org/language/regexes](http://doc.perl6.org/language/regexes)
[http://doc.perl6.org/language/grammars](http://doc.perl6.org/language/grammars)

------
bane
> Maybe some use of XML

No. If you're even thinking about defining a language in XML you've probably
already screwed up some place.

------
orangeduck
Lua has an alternative to Regex which is an extension of the C patterns (used
in printf etc).

[http://lua-users.org/wiki/PatternsTutorial](http://lua-
users.org/wiki/PatternsTutorial)

At first I found it a bit confusion but they're actually pretty great. For me
at least 90% of the tasks I want to do with regex can be done with scanf, and
Lua's little extension covers the remaining 10% quite well.

I know lots of people often say that it would be nicer to use functional
composition for regex instead of strings because strings are too confusing,
but I disagree with this. The confusion of regex to me is not from the string
representation, it is that some characters are "special" while others are
"normal" (including whitespace). At first it appears that most characters are
"normal" and so you can start from some example and generalize the string
until it matches all the things you want - but once you start putting
parenthesis and such in you start to realize that most of the string wont be
matched "normally" and it is better to start thinking like a grammar and write
it from scratch. This double thinking is pretty annoying.

For this reason, to me the Lua patterns are really the only alternative I've
come across to regex that I've liked. They've got nice compact expressive
syntax, can really easily do most of the matching tasks I need due to the
scanf base, (almost) all the "special" characters begin with %, and the
complex cases can still be matched.

~~~
hamstergene
I read the tutorial you linked and I see no difference between Lua patterns
and regexes, except that '\' has been replaced with '%', and '*?' with '-'.
The only thing in common with scanf is adopting '%' as control character,
otherwise that's the same old regular expressions: '%d' matches just one
digit, not whole signed integer like in scanf.

------
MrPatan
To everybody hating on regexes: remember that the alternative to even a simple
regex is typically a poorly-implemented ad-hoc state machine dozens of lines
long full of nested loops and conditionals. Without gotos if you're lucky.

_That_ is a bug breeding ground.

Now, if what you are after are regular expressions with a different syntax....
well, maybe you're onto something here. But I would still call it a regex,
personally.

~~~
sklogic
No, an alternative to the write-only regexps is nice, clean, dense but yet
readable BNF or PEG syntax. And no, they're not regular expressions with a
different syntax, they're much higher up in the Chomsky hierarchy.

------
avera
I have my regex alternative version, currently in development, with C like
syntax. Here is approximate sketch, how it looks. Goal is, that from this
syntax I can get full, hierarchical AST for programming languages parsing:
[http://pastebin.com/CFZf306p](http://pastebin.com/CFZf306p) \- until it's
done, some smaller details can change.

Idea is that it can build multiple layers of matched items. In this example,
layer 0 is "pattern lines" then on top of this, gets mapped tokens
representation at "parttern tokens".

and lastly, over lines and tokens goes AST objects, which are represented by
language code.

Hierarchy and context-based patterns referencing and construction can be
easily achieved.

Ideally, I would like to build this with realtime update feature, when used in
source code editor. In IDE, when I write some characters in code, these
changes propagate into parsed layers objects and updates what is needed.

------
alfiedotwtf
I'd hate to be writing a grammar every time I want to do a substring match.
Regular expressions are convenient, and far more compact.

~~~
adestefan
If you're doing a substring match, then use a substring function.

------
bobbylox
One alternative in the Wolfram Language is called StringExpresions. For
instance, something like

StringReplace["Mad Hatter", "M"|"H"~~a_~~b_..:>"B"<>a<>"g"]

returns "Bag Bager"

The nice thing is that a string expression can use a RegularExpression as part
of the pattern, but it's often not neccesary.

~~~
DannyBee
"The nice thing"

Nothing in this post is a nice thing :)

------
plorg
I rarely have trouble constructing or understanding RegExes, but I do have an
awful time getting quoting correct and remembering which special characters
have to be escaped in any one particular language.

~~~
raiph
In Perl 6 the regex character escaping rule is:

> Alphanumeric characters and the underscore _ are literal matches. All other
> characters must either be escaped with a backslash (for example \: to match
> a colon), or included in quotes.

From
[http://doc.perl6.org/language/regexes](http://doc.perl6.org/language/regexes)

------
gwu78
I go from BRE to flex to spitbol depending on the situation. Never had a need
for PCRE. Snobol can match anything.

k/q also has pattern matching.

Personally, I get more mileage out of BRE than anything else. Simple and
effective.

------
reirob
Wouldn't Marpa [1] be an alternative to REs?

[1]: [http://marpa-guide.github.io/chapter1.html](http://marpa-
guide.github.io/chapter1.html)

------
kazinator
[http://www.nongnu.org/txr](http://www.nongnu.org/txr)

------
tofof
I stopped reading at: "Besides, grammars are more complex than regular
expressions, so they're simpler."

~~~
NaNaN
You are right. The article introduce those ? + * in the grammars again, though
grammers are actually simpler from some aspects.

------
thekaleb
Problems arise with regular expressions when they are used against irregular
grammars.

------
sklogic
I'm almost always using PEG where regexps are typically applied.

~~~
raiph
Perl 6 unifies "regexes", PEGs, and lexical closures.[1]

[1]
[https://en.wikipedia.org/wiki/Perl_6_rules](https://en.wikipedia.org/wiki/Perl_6_rules)

