Hacker News new | past | comments | ask | show | jobs | submit login
Super-expressive – Write regex in natural language (github.com/francisrstokes)
83 points by jack_riminton on Jan 21, 2021 | hide | past | favorite | 56 comments



For me, it's like writing 10824 "ten thousand eight hundred and twenty four".

I'd rather read a real regex using the verbose flag to comment groups:

- it shows the real regex for quick scanning and for those familiar with the syntax

- it explains things to the people that are not familiar with it or if the regex is complicated

- it forces the writer to divide the regex into logical groups

- such a system would have comment anyway, to indicate what you are matching such as "product code, date, color" for each part of the matching code.

- if you can't write or read regex and your job is programming, spending an afternoon learning them should be your next step. They are everywhere, no matter the tech stack: unix tools, IDE, 3rd party libs...

- there are plenty of regex tester UI, which let me copy the regex and test various case to see what it does, and tweak it

But I can see the value of such a lib for learning regexes.


I'll disagree here. While I have quibbles about the specific API used in this project, I like the idea of an imperative regex builder, especially if it can be type checked.

Every time I turn to regex, I waste time debugging which characters I forgot to escape or accidentally escaped when all the brackets and slashes blur together. I debug why some group isn't matching right because regex's semantic density makes it hard to tell where the group starts and ends. I turn to regex debuggers because they're necessary, but they're not great experiences, and at first glance I'd think a type checked regex builder could make debuggers unnecessary a lot of the time.

There's also a discoverability problem. I know non-capturing groups and negative lookbehind are a thing, but I always have to look them up because it's hard to remember the arcane syntax if I don't use them often. And my peers don't even know some of those things exist, so they struggle to solve easy problems. A library that my editor would offer autocomplete suggestions for would really help this.

I also think a regex builder would promote better organization - break the regex into parts, assign the parts to variables, and reuse portions of a regex. That's all possible with traditional regex, but I don't see folks doing it because it seems few folks know about verbose regex, building logic with string concatenation is discouraged in many situations, and if your language has a native regex data type, declining that in favor of string building feels weird. If all a regex builder did was reframe what developers feel is natural to do, that's beneficial.


Here is what I see.

You want an automated tool to build a regular expression for people who don't understand regular expressions. There is no shortage of ways that this is going to lead to disasters. Starting with the fact that far too many developers do not understand the difference between pattern matching and parsing, and will reach for the wrong tool with no idea what it is doing and why they can't get it to do what they want.

See https://stackoverflow.com/questions/1732348/regex-match-open... for more.


I'd like to clarify that my position isn't about "not understanding regular expressions". Sure, it could help people who don't. But even for people like me: I read the O'Reilly regex pocket book and other materials, I studied formal regular languages during college, built a basic lexer/parser for a senior project, I've written no shortage of simple and complicated regexes in application code in the workplace, and once in a conference lecture I was the first audience member to recognize and shout out when the speaker quizzed which commonplace file format the given gnarly full-page regex matched.

I'll never be up there with Brian Kernighan[0], but I know my way around regex at least as well as what I feel is reasonable to expect and accommodate from the average developer.

My position is that even with a background in regexes that's a lot deeper than just Googling and putzing on Regex101.com, traditional regex syntax is still a frustrating time sink that's hard to get correct without more trial-and-error than feels intrinsically necessary. The syntax provides zero opportunity to discover there's a more effective way to perform a task. I have trouble identifying a compelling value proposition for traditional syntax besides familiarity and natural serializability, and the fact that it gets the job done at all.

I don't believe traditional regex syntax is the optimal way to accomplish text pattern matching tasks in the workplace, and I'm open to other tooling that makes it success simpler and more reliable. People misuse regexes all the time (examples like the one you linked are almost tropes at this point), but I don't think that's compelling justification on its own for preserving the status quo.

[0] https://www.cs.princeton.edu/courses/archive/spr09/cos333/be...


My only complaint about the regex syntax is that it does not allow you to separate things out with whitespace, or add comments about what the chunked units. The x modifier fixes both.

What you traditionally see with a complex RE for a complex pattern is the same as what you traditionally see with someone writing complex SQL statements on a single line. Stop trying to treat it like a black box, and treat it as a programming language in its own right. Use whitespace, indentation, and comments (when necessary) to communicate intent as well as just to make it do its job.

Other than that, regular expressions say what they mean and mean what they say very concisely directly. Particularly the PCRE variants of the language. I consider that conciseness and directness a virtue.


For me, a regex usually has to be wrapped into a function, which I can then throw copious amounts of unit tests at. Regex is, IMHO, a easy-to-write hard-to-read language, so I find it more fruitful to use tests to specify what is the task being accomplished, so that - if it's easier - I can just rewrite the appropriate regex from scratch rather than trying to decipher how the old one is broken.

If the task is complex enough, regex might not even be the right tool for the job, and the function boundary provides a sensible encapsulation boundary.


> if you can't write or read regex and your job is programming, spending an afternoon learning them should be your next step

I’m one of those super stubbornly bullheaded people who believes I can learn or do anything if I’m simply willing to devote the time/energy to it so I don’t say this lightly:

I am absolutely incapable of “reading” RegEx.

I use RegEx. I (conceptually) understand RegEx. I can write Regex quickly and effectively without much thought. But I can’t read it to save my life. In fact it’s so difficult for me that I struggle to believe there are people who actually can “read” it.

I can decipher it but it’ll take me a bit - more like solving a little puzzle in my head than reading and understanding a piece of code.

And judging by the way a lot of people talk about RegEx (including seasoned programmers) I can’t imagine I’m alone.


Does any representation deserve to be singled out as the "real regex"? I'd have picked the abstract syntax tree. A bunch of constructor calls are closer to that.

(Yes, the Perl regex syntax has some advantages, and I'm only objecting to what you're calling it. Though it's also true that a tree structure has advantages over a string: for instance, you don't have to parse it to manipulate it.)


I think you are one hundred percent correct :)

But I can see it being useful if the idea is translated to NLP -> regex

a GPT3 to regex would be awesome


I could see it being awesome from a "This is cool" perspective, but I wouldn't trust it to actually work.

Just imagine it getting a negative swapped or something.


I think it would be useful to generate a regex - which would then get written down in the code to ensure it doesn't change? You could test it, ensure it works, then just use the output of the neural net...


Good point, that does actually sound quite useful.


Tools like this keep appearing on HN, and I shudder every time I see them -- not because I don't like the idea, but because of the constraints involved in adopting it.

It would be one thing if this were a tool for generating a regex, which the developer then copies and pastes into the source code. But it’s expected that this syntax is checked in and maintained in the source, with all the requisite constraints and dependencies. It’s unportable to other languages, unreadable without the docs, and non-standard. It's also another runtime dependency that will need periodic security updates applied (I hope :-D). Meanwhile, there is a standard, built-in cross-language syntax for expressing these things (and great tools like https://regexr.com/ for reference).

Maybe this makes sense in one-off scripts, but I don't see it adding value where the code is intended to be a source of truth.


This is largely my opinion too, although with every non-trivial regex (especially with look-aheads and such) that I have seen committed, nobody knows what the hell it's supposed to do unless the original author commented the hell out of it, or is the one working on it (or can be consulted). In that respect something like this seems like a nice improvement since it's basically documentation!

It could be horribly misused though so documentation would be important.

In many respects it feels to regexes like what an ORM is to a database. Probably pros and cons of each.


They list an online gui version to generate: https://sepg.netlify.app/

For me this is pretty cool, I don't have to write much REGEX mostly copypasta from stackoverflow. But when I do need something more unique REGEX is very hard/confusing for me to learn so this type of js property chaining makes sense in my mind.

Though when testing their GUI using the first example in their readme I can't get it to highlight the accepted text


The very first example ends with:

  // Produces the following regular expression:
  /^(?:0x)?([A-Fa-f0-9]{4})$/
It would certainly be helpful to include the input to this API as a code comment along with a link to the playground, but you certainly don't need to do so, and there's very little reason to add a runtime dependency on this library.


Sometimes people have to use things that are non-standard so that they can become standard. Besides, converting to and from standard regexes shouldn't be hard. Yes there are disadvantages to using a non-standard third party library but maybe there are advantage too.


> It would be one thing if this were a tool for generating a regex, which the developer then copies and pastes into the source code. But ...

I'm inclined to agree. I guess the Playground (https://sepg.netlify.app/) together with the documentation is almost that.

Not sure if that will get you better/faster results than just learning regex though. Maybe it could be a good way to figure out some of the trickier regex stuff.


Printable, readable abstract syntax for regex is a thing.

  This is the TXR Lisp interactive listener of TXR 248.
  Quit with :quit or Ctrl-D on an empty line. Ctrl-X ? for cheatsheet.
  TXR contains many small parts, unsuitable for children under 12 months.
  1> (regex-source #/a.*[\s\d:]\d+bcd/)
  (compound #\a (0+ wild) (set :space :digit
                            #\:)
   (1+ :digit) #\b #\c #\d)
  2> *1
  (compound #\a (0+ wild) (set :space :digit
                            #\:)
   (1+ :digit) #\b #\c #\d)
  3> (regex-compile *1) ;; compile source back to regex
  #/a.*[\s\d:]\d+bcd/
  4> (typeof *1)
  cons
  5> (typeof *3)
  regex
(Insert Racket example and others here).


> (Insert Racket example and others here).

Common Lisp (via CL-PPCRE):

  CL-USER> (parse-string "a(?i)b(?-i)c")

  (:SEQUENCE #\a
   (:SEQUENCE (:FLAGS :CASE-INSENSITIVE-P)
    (:SEQUENCE #\b (:SEQUENCE (:FLAGS :CASE-SENSITIVE-P) #\c))))
You can use both notations to compile and execute regexes. Also, bonus point for any Lisp that does it: you don't have to monkey around with ".end()" and such to turn a "fluent syntax" - which is inherently linear - into something resembling a tree, because s-expressions are trees and you get that for free.


The .end() occurs because the approach taken procedural construction via a "linear" path. And that's because that maps nicely to the chained.function().syntax(). That gets ugly if the terms have multiple arguments.

   (S (NP (N DOG))
      (VP (V KICKS))
      (NP (N MAN)))
becomes:

   (S DOG.N().NP()
      KICKS.V().VP()
      MAN.N().NP())
so far not bad, but now we deal with S, that now becomes a method of DOG.(N).(NP).

   DOG.N().NP().S(KICKS.V().VP(),
                  MAN.N().NP())
The main connective S is now buried in the middle. We start with DOG, a fourth-level leaf element, make a N out of it, the NP, and now we start a sentence construction, where we bring in other parts. The structure reveals the evaluation order, not the actual structure.

BTW, the TXR Lisp version of this syntax is a tad more readable:

   DOG.(N).(N).(S KICKS.(V).(VP)
                  MAN.(N).(N))
Even when we have obj.fun(arg) as a given, we should at least move the parenthesis before the function: obj.(fun arg).


I think this is great for maintainability. I saw a blog post earlier advocating for using long form arguments in scripts, "—-help" instead of "-h"

The same applies here, as "/\d/" might be as recognisable for regex savvy developers as "ls -l" for bash savvy, but for other maintainers not as familiar this makes it way easier to maintain the code


There's also VerbalExpressions library that's been ported to 30+ languages: https://github.com/VerbalExpressions/JSVerbalExpressions


I’ve often wondered why there isn't a tool that generates regex for you automatically when you feed it a number of similar strings. Is this too complicated mathematically (even for machine-learning), or does such a thing actually exist (and I just don’t know about it)?

Imagine you want to write a regex that captures hex values such as 0xC0D3. You enter a few sample values (the more you enter, the more concise the regex which the generator will spit out) and the generator should easily discover that all your values start with 0x followed by exactly four digits 0-9 or a-f and give you a (albeit maybe not the best, or several) expression it can. Bonus points if it indeed explains in natural language each part of the regex it generated...

I’d imagine this would also be a great tool for learning regex, a sort-of “learning by reverse-engineering” if you will.

I seem to need regex only once every two years or so, which is not enough to learn it properly or, if I did, to retain it. So such a tool would be awesome.


Emacs has something similar, for generating compressed matchers from examples you provide. Though this operates on a closed-world principle: the result will match your examples, but only your examples.

  ELISP> (message (regexp-opt '("0x12" "0x13" "0x2f" "0x1d" "0xcc")))

  "\\(?:0x\\(?:1[23d]\\|2f\\|cc\\)\\)"
  (in *Messages* buffer)
  \(?:0x\(?:1[23d]\|2f\|cc\)\)
(Note that Emacs' regex syntax is slightly different than PCRE.)


A DFA-based regex engine will spit out an optimal state machine if you give it a regex which just combines all your inputs with the disjunction:

   0xC0D3|0xC0FD|...|0xBEEF
For example, NFA-DFA subset construction algorithm will implicitly figure out that every branch starts with 0x, and so the initial state will have only a single transition out of it on the character 0, and the next state on the character x.

If we feed it every 16 bit hex string from 0x0000 to 0xFFFF, it should reduce to just 7 states:

  S0 -[0]-> S1 -[x]-> S2 -[01234569789ABCDEF]-> S3 ...
What you need is just a way to convert the compiled DFA to a regex, which you could then take in place of the original catenation. The transitions on multiple characters have to be intelligently converted to readable classes like [0-9A-F].

Now suppose we feed it every 0xXXXX string except 0xFFFE. What that simply means is that the last state transition will not include the F character. In that state, if the next input character is F, the machine errors out.

I'm sure there are issues with this idea that have to be solved. It's a famous fact of DFA construction that a case insensitive match like "[Ff][Oo][Oo]..." leads to an exponential explosion of states.


Something like this, perhaps? http://regex.inginf.units.it/


Interesting, though it seems like it might take more time to create enough examples for useful results than to just learn regex (eg. complete https://regexone.com/).

> The quality of the solution depends on a number of factors, including size and syntactical properties of the learning information.

> The algorithms embedded in this experimental prototype have always been tested with at least 25 matches over at least 2 examples. It is very unlikely that a smaller number of matches allows obtaining an useful solution.


The search keyword you might want to use for this is "program synthesis".


I truly don't see the point.

The code that you wind up with is so long that it is hard to read for anyone, whether or not you know regular expressions. I'll never remember what all of the things are called. Anyone who knows regular expressions will find the regular expression readable. Anyone who doesn't is likely to find both relatively similar in effort to learn.

The only tool you need for maintainable regular expressions is the x modifier. That lets you break it up with whitespace and add comments. Here is a real example in code that I wrote for a tool used by people who don't know regular expressions. (This is Python that is parsing out the contents of various arrays in a bash script.)

        # This match will pull those out into an array of pairs
        # representing an array name and the inside of the parens::
        #
        #  [('sqlFiles', ' "foo"
        #              "bar"
        #              "baz" '), ...]
        #
        # 
        match_pairs = re.findall(
            """(?xs)    # x turns on verbose expressions (allowing these comments)
                        # s says . matches everything (including newlines)
             ( \w+ )    # Capture the name of the list
             \s* = \s*  # spaces = spaces
             \(         # find open paren
               (        # Capture it.
                 (?:    # nested non-capturing pattern
                  \s+   # whitespace
                    |   # or
                  ".*?" # " with as few characters as possible then "
                 )*     # non-capturing pattern repeats 0 or more times.
               )        # end capture
             \)         # Closing paren
            """, contents)
If you know regular expressions, the comments are superfluous, though breaking up the expression does make it easier to read. If you don't know regular expressions, you should be able to figure out what the code is trying to do and how it does it.

How you specify that modifier varies heavily by language. So, for example, in Perl you end your expression with /x. In Python you have to start your regular expression with (?x). In Postgres you pass a third argument with 'x' in it. Sadly, JavaScript does not support it. (Big mistake.)


> Sadly, JavaScript does not support it. (Big mistake.)

In languages that do not support the x modifier, I just break up my regex into substrings. Something like:

  auto regex = "( \w+ )"      // Capture the name of the list
             + "\s* = \s*"    // spaces = spaces
             + ...
             + "\)";          // Closing paren


You could perhaps use es6 templates for better readability


That's a useful technique. Thanks.


It has the advantage of using normal syntax rather than string literals (AST manipulation, formatting, static analysis, autocomplete, ...).


Regular expressions are their own language. The fact that we encode a DSL as a string is the problem for tools, and not the fact that strings are hard to handle.

If you offer a syntax extension other than strings to indicate the DSL coming next, and then have syntax highlighting for that DSL, then it would be fine.


Here is (I think) the example regex ported to OCaml's Re library [1]

    let my_regex =
      let open Re in
      seq [
          bos;
          opt (str "0x");
          repn (
              alt [
                  rg 'A' 'F';
                  rg 'a' 'f';
                  rg '0' '9';
              ]
          ) 4 None |> group;
          eos;
      ]
      |> compile
I'm familiar with standard (compact) regex syntax, but I've been using the above syntax recently in a couple small places. I'm a bit on the fence as to which is "better". The compact syntax is, of course, more compact. I think it's a very similar comparison between APL (which I've not used) and most other common programming languages.

One advantage of the expanded syntax is that it's a bit nicer to incorporate a string variable, e.g. "str some_string" vs. "/#{Regexp.escape(some_string)}/" (to borrow Ruby's syntax).

[1] https://github.com/ocaml/ocaml-re


I'd want something like this and the more compact language.

Classic regex syntax has a similar problem as purely expression-based languages: it's hard to clearly delineate between smaller and larger scale structures.

Sometimes, brevity is clarity. It's bad to write matchDigit().times(2).match('-').matchDigit().times(2) instead of simply '\d\d-\d\d'. So even with the problems with metacharacters, I don't think you want to lose that.

But more complex regexes are clearest when the individual parts are assigned to meaningful names and they're then composed into the final expression.

Most regex implementations require that you're doing string-munging to compose regular expressions, and developers and maintainers must be aware of the semantics of that string-munging to the regex compiler.

Allowing the dev to compose regular expressions from parts, then, seems like the greatest opportunity here for improving regex syntax. It'd be especially helpful for any dynamically generated regular expression, and you could have facilities like a "quote" operator.


For those that don't see the point of this yet, it's like a database query builder (or ORM) but for constructing regular expressions. The best implementation that I know of is Eloquent for Laravel:

https://laravel.com/docs/master/eloquent

The idea is that the idiosyncrasies of a DSL like SQL can be abstracted away by providing a functional interface that encodes the bugs/features of the grammar. The simplest example might be how SQL requires the WHERE clause to come before ORDER BY. But in Eloquent, clauses can generally be attached in any order. The clauses are stored as an abstract query object until they're executed, but the raw SQL can be retrieved at any time with the toSql() method, which is similar to the toRegex() method from the article:

https://laravel.com/api/master/Illuminate/Database/Query/Bui...

You can also write your own methods to return clauses. Which allows you to write general queries like "select all of the articles from this user" and then append a clause like "limit to N results starting from X" when it comes time for a controller to run the query and return results to the user. I've found that composability is what makes query builders so exponentially more powerful than raw SQL.

Keep in mind that where Eloquent, this library (and most others) fall down is that they provide no way to go from a raw query (SQL, regex, whatever) back to a query object. Everyone is so busy writing software for writing software, that they forget that most of the everyday workload is in reading someone else's lackluster code.

I don't really know why I wrote all of this, but since I do most of my thinking in query builders now, I thought it might be a useful pattern for others to know.


There's an rx syntax for regexes in emacs [0] that, as far as I know derives from the SRE expression syntax [1] of Olin Shivers. In Lisp, CL-PPCRE [2] manages to be a syntax for PERL compliant regexes that is ever more efficient.

[0] https://www.gnu.org/software/emacs/manual/html_node/elisp/Rx... [1] https://scsh.net/docu/post/sre.html [2] https://edicl.github.io/cl-ppcre/


"Natural language" being

  .startOfInput
  .optional.string('0x')
  .capture
    .exactly(4).anyOf
      .range('A', 'F')
      .range('a', 'f')
      .range('0', '9')
    .end()
  .end()
  .endOfInput
...ok, that's slightly more readable than a regex, because it "unpacks" the arcane syntax, but you still have to be familiar with the workings of regular expressions to understand it. And it also has the disadvantage that you are replacing the widely understood regex syntax with a niche "DSL" - sort of like using an exotic framework for a popular programming language.


Common Lisp had something like this for ages in CL-PPCRE library, with the added benefit of not having to monkey with ".end()" and whatnot, because s-expression syntax guarantees you always know what is enclosed by what.

Still, I came to the similar conclusion as you. It's sometimes nice to check a descriptive format if needed (CL-PPCRE can take a "standard" regex and translate it to its tree-based verbose notation), but it's a chore to type that up and it still requires you to understand how the grammar works. It's much easier to work with the standard notation.

And I mean - what's the big problem here? Regexes are so common and so useful that, in my opinion, any self-respecting developer should learn them to the point they feel comfortable with reading and writing them, with a manual on hand. If you want to make it easier to read, just split it up and comment it, be it with 'x' modifier or with string concatenation and your programming language's native comments.


  SuperExpressive()
    .anythingButString('aeiou')
    .toRegex();
  // ->
  /(?:[^a][^e][^i][^o][^u])/
There's got to be a better way to express that in Regex, right?


Hmmm.... 'beiou' isn't 'aeiou', but it isn't matched by that pattern. It seems like you need /(?:[^a]....|.[^e]...|..[^i]..|...[^o].|....[^u])/.


Wow, what a horribly named function for a library that aims to make regular expressions more readable. The name very strongly suggests it should match anything but the string aeiou, so that it would match for instance the string a, but that is not what it does at all.


Not without look-ahead/look-behind


"Now you have three problems."

Domain-specific notation (provided it's good) is a barrier to entry for the same reason it's powerful: it removes ambiguity and verboseness intrinsic to natural languages.


My number one tool in helping me write or debug regexes is this visualizer: https://regexper.com/


Interesting. This is pretty much parser combinator syntax but restricted to regexes. DX-wise, I'd just prefer a parser combinator library.

In practice this probably compiles to a regex for performance reasons: a JS parser is probably much slower than hooking JS to a C regex library.

In a faster (compiled) language you don't have this limit, eg. Haskell and Rust have several popular parser combinators libraries.


Far be it from me to shame anyone who finds this useful.

Personally speaking, I would find it frustrating, if I had to support such code.


Great, now I have 3 problems.


How does this compare with parser combinators?


This is just an alternative syntax for regular expressions. Parsers can also handle non-regular inputs(context-free grammars).


or just you know, write regex.


Pff real pros write everything in machine code


Indeed they do, https://github.com/concurrencykit/ck

Go back to your js library and maybe one day you'll get to write some Wasm.


Pff Wasm, pure binary too hard for you? amateur




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: