
Regexes vs. Combinatorial Parsing (2019) - pcr910303
https://khanlou.com/2019/12/regex-vs-combinatorial-parsing/
======
unhammer
> This is really the only way to get composability with regexes

If you restrict yourself to "real" regular expressions (avoid Perl extensions
and such), they express regular languages[1], which do compose very well, and
can be compiled into finite state machines.

Regular languages are closed not only under concatenation and Kleene star, but
also:

    
    
        - union
        - intersection
        - complement
        - reverse
    

and more.[1]

Real FST libraries like HFST or OpenFST let you compile and compose regular
expressions this way. It's too most "standard" regex libraries stop at
"re.compile" and never implement concat/intersect/complement/reverse/union.

\-----

For the readability issue, give rx[2] a go – you can turn plain regexes into
rx syntax with xr[3] :)

[1]
[https://en.wikipedia.org/wiki/Regular_language#Closure_prope...](https://en.wikipedia.org/wiki/Regular_language#Closure_properties)

[2] [https://www.emacswiki.org/emacs/rx](https://www.emacswiki.org/emacs/rx)

[3] [https://github.com/mattiase/xr](https://github.com/mattiase/xr)

~~~
beeforpork
> ... and can be compiled into finite state machines. ...

I am not a big fan of this argument without a 'but'. It is usually overlooked
that the finite state machine can become exponentially large (=totally
unusable), because sets of non-deterministic states are used for deterministic
states. This caused me major trouble in a project where this effect was not
noticed during the design phase.

I find this important to stress because the finite state machine can usually
be constructed without problems (e.g. by the flex tool), so this property of
regexps is undoubtedly useful.

But sometimes, it may explode on you.

~~~
a1369209993
That's _deterministic_ finite state machines (DFA); the nondeterministic (NFA)
representation is linear in regex length (unless you use "a{99}", but that's
nonstandard for a reason). You can also do concat/intersect/etc directly on
the (actually-regular) regex representaion directly, although it _is_ a bit
ugly.

------
ben509
So regexes tend to be inscrutable and uncomposable largely because of legacy,
and partly because your operators naturally clash with the characters you're
trying to match.

When Larry Wall was designing Rake nee Perl 6, he started by throwing out the
POSIX baggage[1] to make the new regexes more readable. I think the common
cases are far more readable, though you can judge for yourself.

It definitely improves composing regexes by providing a sane, built-in syntax
for it.[2]

And since you're reading this, Raku also offers a full parser through its
grammars.[3]

[1]:
[https://design.raku.org/S05.html#Simplified_lexical_parsing_...](https://design.raku.org/S05.html#Simplified_lexical_parsing_of_patterns)

[2]:
[https://docs.raku.org/language/regexes#Regex_interpolation](https://docs.raku.org/language/regexes#Regex_interpolation)

[3]:
[https://docs.raku.org/language/grammars.html](https://docs.raku.org/language/grammars.html)

~~~
thesz
What Pike did is creation of language feature. Original post talks about
library.

Libraries in well-typed languages are of about same power as language features
and they are within reach of ordinary programmer. If you need a library, you
can write it and use and publish. If you need a (change of) language feature,
you have to convince others to write that for you.

~~~
bmn__
That claim is false, I just need to bring one counter-example to disprove it.
In Perl, regex is a language feature, yet it can be easily supplanted with a
library that anyone can use and publish:
[https://metacpan.org/search?q=re%3A%3Aengine%3A%3A](https://metacpan.org/search?q=re%3A%3Aengine%3A%3A)
It is not necessary to convince the Perl maintainers to write that for you.

~~~
thesz
You, from my point of view, disproved nothing.

Can you intersperse regexps-as-language-feature with regexps-as-a-library? If
answer is "yes", then, I think, you actually proven my point and what is
language feature is, actually, a library in disguise. A special kind of
library, so to say. If answer is "no", then if you miss something in first
thing (an ability to add arbitrary parsing from these libraries, for example)
you need to convince language maintainers to work for you, at least, review
your work and your proposal.

------
brigandish
Some quick thoughts on the first two problems listed:

Inscrutability - use the `x` modifier and allow whitespace and comments, e.g.

    
    
        /^(\\^+|_+|=)?([a-gA-G])(,+|'+)?(\\d*)(\\/)?(\\d*)$
    

becomes:

    
    
        /^
          ( \\^+ | _+ | = )?  # sharps and flats
          ( [a-gA-G] )        # notes
          ( ,+ | '+ )?
          ( \\d* )            # duration
          ( \\/ )?            # fractions
          ( \\d* )            # more duration? I'd need to read the spec
        $/x
    

and use different delimiters to remove escaping of slashes:

    
    
        <^
          ( \\^+ | _+ | = )?  # sharps and flats
          ( [a-gA-G] )        # notes
          ( ,+ | '+ )?
          ( \\d* )            # duration
          ( / )?              # fractions
          ( \\d* )            # more duration? I'd need to read the spec
        $>x
    

That's already much more readable, but then we have complaint number 2,
composability. If you language supports it then use interpolation:

    
    
        SHARPS_AND_FLATS = /\\^+ | _+ | =/x
        NOTES = /[a-g]/i
    
        <^
          (#{SHARPS_AND_FLATS})?  # sharps and flats
          (#{NOTES})              # notes
          ( ,+ | '+ )?
          ( \\d* )                # duration
          ( / )?                  # fractions
          ( \\d* )                # more duration? I'd need to read the spec
        $>x
    

You see how it could go on. I'm not saying regex are always the right solution
and not all languages have access to the same tools but there's often a lot
that can be done to mitigate the awfulness (and a lot more here, lookarounds,
named captures etc).

The `x` modifier is the #1 fix, I never write anything above the most simple
regex without it, and for the life of me I can't understand why it's hardly
ever used.

~~~
wodenokoto
Do you have a tutorial or article you recommend to read up on x-modifier?

~~~
brigandish
My favourite regex site is Rexegg[1]. I thought it had died so I was using
archive.org but it appears to be back. Unmatched, in my opinion.

Still, there's not much to the `x` modifier, you just need to know that it
will ignore implicit whitespace and comments (# blah blah) so you must use
`\s` or `\t` etc when you really want to match some whitespace. Otherwise you
can do things like:

    
    
        "Hello, world!".split(/ [ \s , ]+ # this splits on at least one space or comma /x)
    

Which is overkill for a simple pattern (then again, why not?) but unbelievably
useful for a complex one. Add spaces and newlines until it's readable.

It's a small change but a big effect.

[1] [https://www.rexegg.com/regex-
modifiers.html](https://www.rexegg.com/regex-modifiers.html)

------
jakear
One approach I like for this sort of thing is Pratt parsing. A full Pratt
parsing "library" can be implemented with barely any work (hour or two), and
from there you get quick and easy expression parsing with full support for
precedence levels, left/right associativity, infix operations, etc.

Doc I read for introduction to the area:
[http://journal.stuffwithstuff.com/2011/03/19/pratt-
parsers-e...](http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-
expression-parsing-made-easy/)

Quick little TS implementation:
[https://github.com/JacksonKearl/PrattParse](https://github.com/JacksonKearl/PrattParse)

Above used in a cute little D&D dice rolling discord app (basically
implementing a language with random variables as literals & first-class
values): [https://github.com/JacksonKearl/RollDice-
Discord/](https://github.com/JacksonKearl/RollDice-Discord/)

~~~
wwee
I found this article[1] to be a great introduction to Pratt Parsing as well,
it even ties into how it's actually a Shunting Yard algorithm in disguise[2].

[1][https://matklad.github.io/2020/04/13/simple-but-powerful-
pra...](https://matklad.github.io/2020/04/13/simple-but-powerful-pratt-
parsing.html)

[2][https://matklad.github.io/2020/04/15/from-pratt-to-
dijkstra....](https://matklad.github.io/2020/04/15/from-pratt-to-
dijkstra.html)

------
FridgeSeal
I actually love parser-combinators.

Something about writing them feels very 'correct' and they're often the first
I reach for when I've got to parse something and there isn't a pre-existing
library.

It's one of the tasks Haskell feels especially well suited for: the function
composition tools and the syntax make reading and writing them especially
straightforward. Writing parser combinators in other languages just feels
clunky.

~~~
ben509
They are honestly fun to write, but then you're doing recursive descent.
Having written a significant one, I felt the elegance of the concept was
marred by all the workarounds creeping into the code.

------
yen223
Parser combinators allows one to take the large problem of parsing a language,
decompose them into smaller, solvable problems of applying grammar rules, and
then compose the solution together via combinators. This is a simple-yet-
powerful approach that I wish were more prevalent in other areas of software
engineering. Most tools that we know and love (e.g. regex) aren't composable
enough.

------
taeric
The readability is something you can get with a simple builder over regexes.
Emacs has rx, as an example.

Getting back a parse tree, though, is a huge step. I've always been surprised
that there isn't a bnf language that lets you write the grammar and get back
an instance of it from a string. Yacc, I suppose, is what I'm asking for.
Feels off, though.

------
eesmith
The ABC notation reminds me of the SMX notation I used on the IBM PC BASICA
back in the 1980s.
[https://en.wikipedia.org/wiki/Music_Macro_Language#SMX](https://en.wikipedia.org/wiki/Music_Macro_Language#SMX)
. I used to have a lot of fun encoding simple songs that way - and debugging
my songs.

I haven't used it since then. I quick comparison now shows the two formats
have different heritage. Still, the flashback was nice.

------
mimixco
This has my been experience, too. We needed to parse a complex DSL in Lisp and
I was hoping we didn't have to use RegEx for all the reasons you mentioned.
The parser combinator we used (MaxPC) brought with it many advantages:
readability and small individual parsers were at the top of the list.

There was a learning curve in switching to how to think about parser
combinators but, once we had tried every single construct in MaxPC and done
everything wrong that was possible(!), we learned how to really make the
system fly. Now I feel comfortable that we could write a parser for anything
with these ideas.

And your music notation parser sounds cool.

------
b2gills
It could be clearer if you used Raku regexes

    
    
        my $pattern = /:sigspace
          ^
          score  $<points>     = (\d+)
          for    $<nights>     = (\d+)  nights?
          at     $<hotel-name> = (.*)
          $
        /
    
        if 'score 400 for 2 nights at Minas Tirith Airport' ~~ $pattern {
          say "the score was $<points>";
          say "it occurred over $<nights>";
          say "at $<hotel-name>";
        }
    

Or if you really wanted to be ambitious, a grammar

    
    
        grammar Pattern {
          token points {\d+}
          rule  nights {<(\d+)> nights?}
          token hotel-name {.*}
          rule TOP {
            score <points> for <nights> at <hotel-name>
          }
        }
    
        if Pattern.parse('score 400 for 2 nights at Minas Tirith Airport') {
          say "the score was $<points>";
          say "it occurred over $<nights>";
          say "at $<hotel-name>";
        }
    

Note that only the `s` is optional `?` in `points?`. To make the whole thing
optional would require grouping it in some manner `[points]?` or even
`"points"?`

------
bjoli
Scheme irregex addresses a lot of these problems. It has become my go-to tool
for regular expressions in scheme.

[https://srfi.schemers.org/srfi-115/srfi-115.html](https://srfi.schemers.org/srfi-115/srfi-115.html)

------
Xophmeister
I really like playing around with combinatorial parsers. I feel like it's
almost a natural development in any software engineer's life:

* Very Early Days: Basic string manipulation.

* Early Days: Regular expressions; perhaps including misguided efforts to parse HTML/e-mail addresses, only abandoned due to what seems like never ending complexity, rather than realising it's not possible.

* Not Long After: Overusing/abusing regular expressions; perhaps with the understanding that they can't solve all parsing problems, but will get you most/some of the way there; you know about parsers, but they feel too mystical ("something that a compiler does").

* Pre-Enlightenment: Writing very simple, recursive descent parsers; perhaps reading up/playing with things like Yacc, Bison, ANTLR, etc.

* Enlightenment: Abstracting your earlier recursive descent parsers with combinator primitives that maybe mimic the quantifiers used in regular expressions. Now you can parse (almost?) anything! (n.b., This step often only happens if you've first delved into functional programming.)

* Post-Enlightenment: Overusing/abusing parser combinators -- DSL all the things!! -- but all the while, perhaps realising their limits/complexity class.

I'm being somewhat facetious, rather than giving myself a backhanded
compliment, but you get the idea :) The more powerful tools you discover, the
more you want to use them until you inevitably outgrow them. Besides realising
that "lesser" tools have their place, I'm curious what happens next...

~~~
vanderZwan
> * Post-Enlightenment: Overusing/abusing parser combinators -- DSL all the
> things!! -- but all the while, perhaps realising their limits/complexity
> class.

Have you looked at the Rosie Pattern Language? I think it's quite an
interesting attempt at a more "high level" modern pattern language. It's more
human-readable and tries to make adding tests as easy as possible (given how
monstrous regexps can get, I think the importance of that shouldn't be
underestimated)

[https://rosie-lang.org/blog/2019/07/18/practicalpegs.html](https://rosie-
lang.org/blog/2019/07/18/practicalpegs.html)

------
amake
I had a great experience using PetitParser
([https://github.com/petitparser](https://github.com/petitparser)) in Dart,
though it turned out my use case was a bit hairy for combinatorial parsing: I
made a parser for org-mode ([https://orgmode.org/](https://orgmode.org/))
markup, which is similar to Markdown.

Markdown-like formats are quite line-oriented and sometimes context-sensitive
in ways that make rules hard to express.

I did get something usable, but I had to make some new primitives and break
some efficiency rules (copying substrings).

The result is here:
[https://github.com/amake/org_parser](https://github.com/amake/org_parser)

(Shameless plug) It’s used in my FOSS mobile org-mode viewer app Orgro:
[https://orgro.org](https://orgro.org)

------
polote
> In order to compose these regexes together, we have to break them apart
> first, and then apply some string concatenation

I have been fighting with that for parsing sentences which contains locations,
and yeah there is no easy solution. My final way of doing it is a combination
of :

\- string variable concatenation

\- regex multi-line mode

\- comments inside the regex

------
flowcont
I discovered parser combinators a few years ago in Scala with parboiled[1] and
fell in love with them. Although I haven't use them lately, I still have to
look into Fastparse[2], and I'm sure I will have fun going through this
article of parser combinators in Rust[3].

[1]
[https://github.com/sirthias/parboiled2](https://github.com/sirthias/parboiled2)

[2]
[https://github.com/lihaoyi/fastparse](https://github.com/lihaoyi/fastparse)

[3] [https://bodil.lol/parser-combinators/](https://bodil.lol/parser-
combinators/)

------
zelphirkalt
And the next level would be to create a grammar for what one wants to parse,
which would make it more declarative and even better.

The problem with parser combinators is not as big as with regular expressions,
but they still have a problem, once the format suddenly changes. You might
need to change your code, possibly a lot. Using grammars, you can change only
the declarative grammar and the program stays the same. Also a grammar can be
used from other programming languages as well, if they support the same
grammar syntax (have a parser for that syntax, ha!).

------
hansvm
Parser combinators are fantastic tools, and in many languages there isn't a
need to roll your own like the author did. I've enjoyed the following in
Python and C#:
[https://github.com/sighingnow/parsec.py](https://github.com/sighingnow/parsec.py)
[https://www.nuget.org/packages/LanguageExt.Parsec/3.4.15](https://www.nuget.org/packages/LanguageExt.Parsec/3.4.15)

------
slashdotdash
Elixir has the NimbleParsec library which generates parsing code that takes
advantage of pattern matching for high performance and low memory usage due to
the BEAM virtual machine optimisations.

[https://github.com/dashbitco/nimble_parsec](https://github.com/dashbitco/nimble_parsec)

------
oracular_demon
I've had a good experience with this approach recently for a project where I
wanted to extract the components of a recipe ingredient:
[https://github.com/JedS6391/RecipeIngredientParser](https://github.com/JedS6391/RecipeIngredientParser)

------
jariel
Brilliant, having not thought about the subject deeply, it seems this article
is like an intellectual rendering of my otherwise subconscious frustration.

Now we just need a better solution.

