Hacker News new | past | comments | ask | show | jobs | submit login
Regexes vs. Combinatorial Parsing (2019) (khanlou.com)
154 points by pcr910303 70 days ago | hide | past | favorite | 45 comments

> This is really the only way to get composability with regexes

If you restrict yourself to "real" regular expressions (avoid Perl extensions and such), they express regular languages[1], which do compose very well, and can be compiled into finite state machines.

Regular languages are closed not only under concatenation and Kleene star, but also:

    - union
    - intersection
    - complement
    - reverse
and more.[1]

Real FST libraries like HFST or OpenFST let you compile and compose regular expressions this way. It's too most "standard" regex libraries stop at "re.compile" and never implement concat/intersect/complement/reverse/union.


For the readability issue, give rx[2] a go – you can turn plain regexes into rx syntax with xr[3] :)

[1] https://en.wikipedia.org/wiki/Regular_language#Closure_prope...

[2] https://www.emacswiki.org/emacs/rx

[3] https://github.com/mattiase/xr

> ... and can be compiled into finite state machines. ...

I am not a big fan of this argument without a 'but'. It is usually overlooked that the finite state machine can become exponentially large (=totally unusable), because sets of non-deterministic states are used for deterministic states. This caused me major trouble in a project where this effect was not noticed during the design phase.

I find this important to stress because the finite state machine can usually be constructed without problems (e.g. by the flex tool), so this property of regexps is undoubtedly useful.

But sometimes, it may explode on you.

That's deterministic finite state machines (DFA); the nondeterministic (NFA) representation is linear in regex length (unless you use "a{99}", but that's nonstandard for a reason). You can also do concat/intersect/etc directly on the (actually-regular) regex representaion directly, although it is a bit ugly.

Unfortunately, this is not the case as soon as you add grouping! To properly express regular expression with grouping, finite state automatons are not sufficient and you need the theory of transducers which does not admit the same properties (in particular regarding to determinisation).

What do you mean by 'grouping'?

Capture groups, I think.

> Regular languages are closed not only under concatenation and Kleene star, but also:

> - union

Union doesn't belong in the bottom list. Disjunction is one of the regular operations, just like concatenation and Kleene star.

> > Regular languages are closed not only under concatenation and Kleene star, but also:

> > - union

> Union doesn't belong in the bottom list. Disjunction is one of the regular operations, just like concatenation and Kleene star.

That seems to be what your parent said. Did you perhaps read "Regular languages are closed not only under …" as "Regular languages are not closed under"?

The parent comment says "closed not only under concatenation and Kleene star, but also under these other four operations". That divides the closure properties into two sets, obvious and nonobvious. The obvious ones are obvious because they are regular operations -- concatenation and Kleene star are operators defined by the language of regular expressions. Intersection isn't.

I'm pointing out that set union belongs in the obvious group and not the nonobvious group. Like concatenation and Kleene star, it is one of the operators used to define what a regular expression is.

To show that regular languages are closed under intersection / complementation / string reversal, you need to do a proof. The proof of closure under set union is just that it's part of the definition of regular expressions.

So regexes tend to be inscrutable and uncomposable largely because of legacy, and partly because your operators naturally clash with the characters you're trying to match.

When Larry Wall was designing Rake nee Perl 6, he started by throwing out the POSIX baggage[1] to make the new regexes more readable. I think the common cases are far more readable, though you can judge for yourself.

It definitely improves composing regexes by providing a sane, built-in syntax for it.[2]

And since you're reading this, Raku also offers a full parser through its grammars.[3]

[1]: https://design.raku.org/S05.html#Simplified_lexical_parsing_...

[2]: https://docs.raku.org/language/regexes#Regex_interpolation

[3]: https://docs.raku.org/language/grammars.html

What Pike did is creation of language feature. Original post talks about library.

Libraries in well-typed languages are of about same power as language features and they are within reach of ordinary programmer. If you need a library, you can write it and use and publish. If you need a (change of) language feature, you have to convince others to write that for you.

Even in the OP, the author is already using NSRegularExpression, because Swift doesn't have regular expressions built in.

And if Next can write NSRegularExpression, anyone can write their own library that uses a better syntax and that supports interpolation.

There's also no reason why a pure library has to make a single string its interface. For instance, in Python I might use overloaded operators:

    ((R('foo') | 'bar') + 'tholeme' + {'w', 'W'})[1:7]
to mean:

There are some rough edges because you're abusing operator overloading, for sure, but expressing a regex is fundamentally composing assertions with operators. It doesn't _require_ any special syntax.

And some languages give you a lot of power to write your own DSLs that will feel more native, e.g. Scheme[1], Haskell[2], Ruby and so forth.

[1]: https://docs.racket-lang.org/guide/macros.html

[2]: https://wiki.haskell.org/Template_Haskell

That claim is false, I just need to bring one counter-example to disprove it. In Perl, regex is a language feature, yet it can be easily supplanted with a library that anyone can use and publish: https://metacpan.org/search?q=re%3A%3Aengine%3A%3A It is not necessary to convince the Perl maintainers to write that for you.

You, from my point of view, disproved nothing.

Can you intersperse regexps-as-language-feature with regexps-as-a-library? If answer is "yes", then, I think, you actually proven my point and what is language feature is, actually, a library in disguise. A special kind of library, so to say. If answer is "no", then if you miss something in first thing (an ability to add arbitrary parsing from these libraries, for example) you need to convince language maintainers to work for you, at least, review your work and your proposal.

Some quick thoughts on the first two problems listed:

Inscrutability - use the `x` modifier and allow whitespace and comments, e.g.


      ( \\^+ | _+ | = )?  # sharps and flats
      ( [a-gA-G] )        # notes
      ( ,+ | '+ )?
      ( \\d* )            # duration
      ( \\/ )?            # fractions
      ( \\d* )            # more duration? I'd need to read the spec
and use different delimiters to remove escaping of slashes:

      ( \\^+ | _+ | = )?  # sharps and flats
      ( [a-gA-G] )        # notes
      ( ,+ | '+ )?
      ( \\d* )            # duration
      ( / )?              # fractions
      ( \\d* )            # more duration? I'd need to read the spec
That's already much more readable, but then we have complaint number 2, composability. If you language supports it then use interpolation:

    SHARPS_AND_FLATS = /\\^+ | _+ | =/x
    NOTES = /[a-g]/i

      (#{SHARPS_AND_FLATS})?  # sharps and flats
      (#{NOTES})              # notes
      ( ,+ | '+ )?
      ( \\d* )                # duration
      ( / )?                  # fractions
      ( \\d* )                # more duration? I'd need to read the spec
You see how it could go on. I'm not saying regex are always the right solution and not all languages have access to the same tools but there's often a lot that can be done to mitigate the awfulness (and a lot more here, lookarounds, named captures etc).

The `x` modifier is the #1 fix, I never write anything above the most simple regex without it, and for the life of me I can't understand why it's hardly ever used.

Also, named capture groups. Often I find my code much more readable when I name different fragments of the regex and then refer to those names later on in my code.

Yes, that would be my next top tip too.

    SHARPS_AND_FLATS = /\\^+ | _+ | =/x
    NOTES = /[a-g]/i

    p = <^
      (?<pitch>#{SHARPS_AND_FLATS})?  # sharps and flats
      (?<notes>#{NOTES})              # notes
      ( ,+ | '+ )?
      ( \\d* )                # duration
      ( / )?                  # fractions
      ( \\d* )                # more duration? I'd need to read the spec

    match_data = p.match(music)
    match_data[:pitch] # => "#"
    match_data[:notes] # => "ADEADE"
Much more readable.

I got a lot of mileage out of the x modifier, too, until I bumped into the Composed Regex pattern. https://www.martinfowler.com/bliki/ComposedRegex.html

Nice explanation (as expected from Fowler). Yep, mix that with a bit of interpolation and an iterator, maybe wrapped in an object and you have something readable and flexible. Not sure why people want to make regex any harder than they need to be?

in perl's "apocalypses" on pattern matching[0] (I.e. perl6 regexp design documents) Larry Wall immediately states that /x should be the default (and not be an option at all!) for regexes.

It's incredible how useful it is and yet by not being the default every regex implementation has nudged their users to write inscrutable regular expressions for decades.

[0] https://raku.org/archive/doc/design/apo/A05.html

I must give Perl 6 another look, I was using 5 up until about 2009 when I replaced it with Ruby. Be nice to see what's been done in that time.

Please note that Perl 6 has been renamed to Raku (https://raku.org using the #rakulang tag on social media).

Do you have a tutorial or article you recommend to read up on x-modifier?

My favourite regex site is Rexegg[1]. I thought it had died so I was using archive.org but it appears to be back. Unmatched, in my opinion.

Still, there's not much to the `x` modifier, you just need to know that it will ignore implicit whitespace and comments (# blah blah) so you must use `\s` or `\t` etc when you really want to match some whitespace. Otherwise you can do things like:

    "Hello, world!".split(/ [ \s , ]+ # this splits on at least one space or comma /x)
Which is overkill for a simple pattern (then again, why not?) but unbelievably useful for a complex one. Add spaces and newlines until it's readable.

It's a small change but a big effect.

[1] https://www.rexegg.com/regex-modifiers.html

Implementation may vary, check your regex library’s docs (usually listed under “flags”). Generally there’s not much to it; specify the flag and then whitespace and some form of comment within the string is ignored.

One approach I like for this sort of thing is Pratt parsing. A full Pratt parsing "library" can be implemented with barely any work (hour or two), and from there you get quick and easy expression parsing with full support for precedence levels, left/right associativity, infix operations, etc.

Doc I read for introduction to the area: http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-e...

Quick little TS implementation: https://github.com/JacksonKearl/PrattParse

Above used in a cute little D&D dice rolling discord app (basically implementing a language with random variables as literals & first-class values): https://github.com/JacksonKearl/RollDice-Discord/

I found this article[1] to be a great introduction to Pratt Parsing as well, it even ties into how it's actually a Shunting Yard algorithm in disguise[2].



I actually love parser-combinators.

Something about writing them feels very 'correct' and they're often the first I reach for when I've got to parse something and there isn't a pre-existing library.

It's one of the tasks Haskell feels especially well suited for: the function composition tools and the syntax make reading and writing them especially straightforward. Writing parser combinators in other languages just feels clunky.

They are honestly fun to write, but then you're doing recursive descent. Having written a significant one, I felt the elegance of the concept was marred by all the workarounds creeping into the code.

Parser combinators allows one to take the large problem of parsing a language, decompose them into smaller, solvable problems of applying grammar rules, and then compose the solution together via combinators. This is a simple-yet-powerful approach that I wish were more prevalent in other areas of software engineering. Most tools that we know and love (e.g. regex) aren't composable enough.

The readability is something you can get with a simple builder over regexes. Emacs has rx, as an example.

Getting back a parse tree, though, is a huge step. I've always been surprised that there isn't a bnf language that lets you write the grammar and get back an instance of it from a string. Yacc, I suppose, is what I'm asking for. Feels off, though.

The ABC notation reminds me of the SMX notation I used on the IBM PC BASICA back in the 1980s. https://en.wikipedia.org/wiki/Music_Macro_Language#SMX . I used to have a lot of fun encoding simple songs that way - and debugging my songs.

I haven't used it since then. I quick comparison now shows the two formats have different heritage. Still, the flashback was nice.

This has my been experience, too. We needed to parse a complex DSL in Lisp and I was hoping we didn't have to use RegEx for all the reasons you mentioned. The parser combinator we used (MaxPC) brought with it many advantages: readability and small individual parsers were at the top of the list.

There was a learning curve in switching to how to think about parser combinators but, once we had tried every single construct in MaxPC and done everything wrong that was possible(!), we learned how to really make the system fly. Now I feel comfortable that we could write a parser for anything with these ideas.

And your music notation parser sounds cool.

It could be clearer if you used Raku regexes

    my $pattern = /:sigspace
      score  $<points>     = (\d+)
      for    $<nights>     = (\d+)  nights?
      at     $<hotel-name> = (.*)

    if 'score 400 for 2 nights at Minas Tirith Airport' ~~ $pattern {
      say "the score was $<points>";
      say "it occurred over $<nights>";
      say "at $<hotel-name>";
Or if you really wanted to be ambitious, a grammar

    grammar Pattern {
      token points {\d+}
      rule  nights {<(\d+)> nights?}
      token hotel-name {.*}
      rule TOP {
        score <points> for <nights> at <hotel-name>

    if Pattern.parse('score 400 for 2 nights at Minas Tirith Airport') {
      say "the score was $<points>";
      say "it occurred over $<nights>";
      say "at $<hotel-name>";
Note that only the `s` is optional `?` in `points?`. To make the whole thing optional would require grouping it in some manner `[points]?` or even `"points"?`

Scheme irregex addresses a lot of these problems. It has become my go-to tool for regular expressions in scheme.


I really like playing around with combinatorial parsers. I feel like it's almost a natural development in any software engineer's life:

* Very Early Days: Basic string manipulation.

* Early Days: Regular expressions; perhaps including misguided efforts to parse HTML/e-mail addresses, only abandoned due to what seems like never ending complexity, rather than realising it's not possible.

* Not Long After: Overusing/abusing regular expressions; perhaps with the understanding that they can't solve all parsing problems, but will get you most/some of the way there; you know about parsers, but they feel too mystical ("something that a compiler does").

* Pre-Enlightenment: Writing very simple, recursive descent parsers; perhaps reading up/playing with things like Yacc, Bison, ANTLR, etc.

* Enlightenment: Abstracting your earlier recursive descent parsers with combinator primitives that maybe mimic the quantifiers used in regular expressions. Now you can parse (almost?) anything! (n.b., This step often only happens if you've first delved into functional programming.)

* Post-Enlightenment: Overusing/abusing parser combinators -- DSL all the things!! -- but all the while, perhaps realising their limits/complexity class.

I'm being somewhat facetious, rather than giving myself a backhanded compliment, but you get the idea :) The more powerful tools you discover, the more you want to use them until you inevitably outgrow them. Besides realising that "lesser" tools have their place, I'm curious what happens next...

> * Post-Enlightenment: Overusing/abusing parser combinators -- DSL all the things!! -- but all the while, perhaps realising their limits/complexity class.

Have you looked at the Rosie Pattern Language? I think it's quite an interesting attempt at a more "high level" modern pattern language. It's more human-readable and tries to make adding tests as easy as possible (given how monstrous regexps can get, I think the importance of that shouldn't be underestimated)


I had a great experience using PetitParser (https://github.com/petitparser) in Dart, though it turned out my use case was a bit hairy for combinatorial parsing: I made a parser for org-mode (https://orgmode.org/) markup, which is similar to Markdown.

Markdown-like formats are quite line-oriented and sometimes context-sensitive in ways that make rules hard to express.

I did get something usable, but I had to make some new primitives and break some efficiency rules (copying substrings).

The result is here: https://github.com/amake/org_parser

(Shameless plug) It’s used in my FOSS mobile org-mode viewer app Orgro: https://orgro.org

> In order to compose these regexes together, we have to break them apart first, and then apply some string concatenation

I have been fighting with that for parsing sentences which contains locations, and yeah there is no easy solution. My final way of doing it is a combination of :

- string variable concatenation

- regex multi-line mode

- comments inside the regex

I discovered parser combinators a few years ago in Scala with parboiled[1] and fell in love with them. Although I haven't use them lately, I still have to look into Fastparse[2], and I'm sure I will have fun going through this article of parser combinators in Rust[3].

[1] https://github.com/sirthias/parboiled2

[2] https://github.com/lihaoyi/fastparse

[3] https://bodil.lol/parser-combinators/

And the next level would be to create a grammar for what one wants to parse, which would make it more declarative and even better.

The problem with parser combinators is not as big as with regular expressions, but they still have a problem, once the format suddenly changes. You might need to change your code, possibly a lot. Using grammars, you can change only the declarative grammar and the program stays the same. Also a grammar can be used from other programming languages as well, if they support the same grammar syntax (have a parser for that syntax, ha!).

Parser combinators are fantastic tools, and in many languages there isn't a need to roll your own like the author did. I've enjoyed the following in Python and C#: https://github.com/sighingnow/parsec.py https://www.nuget.org/packages/LanguageExt.Parsec/3.4.15

Elixir has the NimbleParsec library which generates parsing code that takes advantage of pattern matching for high performance and low memory usage due to the BEAM virtual machine optimisations.


I've had a good experience with this approach recently for a project where I wanted to extract the components of a recipe ingredient: https://github.com/JedS6391/RecipeIngredientParser

Brilliant, having not thought about the subject deeply, it seems this article is like an intellectual rendering of my otherwise subconscious frustration.

Now we just need a better solution.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact