PCRE has builtin support for this kind of factoring, too:
(?(DEFINE)
(?<code>
[A-Z]*H # prefix
\d+ # digits
[a-z]* # suffix
)
(?<multicode>
(?: \( \s* )? # maybe open paren and maybe space
(?&code) # one code
(?: \s* \+ \s* (?&code) )* # maybe followed by other codes, plus-separated
(?: \s* [\):+] )? # maybe space and maybe close paren or colon or plus
)
)
( (?&multicode) ) # code (capture)
( .*? ) # message (capture): everything ...
(?= # ... up to (but excluding) ...
(?&multicode) # ... the next code
(?! [^\w\s] ) # (but not when followed by punctuation)
| $ # ... or the end
)
If the regular expression engine accepted tree structures instead of just strings, you could have first class definitions of fragments of regular expressions. Even better, you could define them as functions, so you could have parameterized fragments. So then you could just apply something like http://edicl.github.io/cl-ppcre/#create-scanner2 on the resulting expression tree without having to use the bizarre definition syntax above.
I'm working on this right now for sed :). It currently works with both GNU and BSD-style sed, either for BREs or EREs. I'm going to make it easier to install sometime soon and then hopefully expand the project to other languages.
And because the PCRE library is integrated in a huge number of languages (it's almost hard to find a language that doesn't have it - I'm looking at you JavaScript), these types of REGEXs are actually widely available.
`(?N)` where `N` is group number and `(?&name)` where `name` is named group are known as subexpression calls. The third-party `regex` module (https://pypi.org/project/regex/) supports this and more such PCRE features.
The real kicker, hidden between everything, is that you can combine f-strings and r-strings.
fr”this is both an f- and an r-string”
I had no idea. I wish Python allowed for custom string types. I would love a sql string type if for nothin else than to show my code editor how to highlight inside the string!
That’s not true, though! It’s written for humans to read, not for machines to parse, and any human reading this will realize what they’re supposed to be.
I think he means that in the code you'll want to use " rather than ”. ” often end up happening when copy pasting code from some websites or apps like Word and they often cause trouble because they are not parsed as quote characters by compilers/interpreters.
Code is very often written to be copied and pasted though! And you could be surprised by the amount of people not noticing the difference between the quotes.
Those people will notice when their compiler complains and will hopefully know better than to copy+paste something from an untrusted website next time.
This code is very obviously just illustrating a point to the people reading it, it seems unlikely that anyone would want to copy and paste it. Lots of code snippets are incorrect code. For example, I often write C code like this:
int x = ...;
The line contains a syntax error, but I’m communicating to the people reading that x is initialized to some value.
Oh ho ho ho… this is good. Anything that breaks apart regexes to make them easier to read and comment each logical unit is worth the extra lines and syntax. This is killer.
It looks like this particular one has had some staying power, because it (or the PCRE version mentioned elsewhere in the comments) have been rather widely implemented.
I didn't know about this and was a happy regular expression user without it, but this looks like a good feature for the specific use case of wanting other people to understand the structure of your regular expressions. And much more portable than I would have expected.
A regex is just a fast, usually integrated into the language, universally understood way to do some simple parsing. Parser combinators are an amazing specialized tool for building parsers, but one that is generally harder to integrate into a code-base (outside of e.g. Haskell), requires lesser known libraries, and is a paradigm that people need time to get used to.
It's like saying people shouldn't use a mitre-box [1] and instead use a full fledged circular mitre-saw [2]. Yes the second tool is much more versatile, powerful and useful. But it requires much more setup, skill and investment to actually use.
So is SNOBAL. i don't disagree that long regexes are difficult to read only that real programmers dont really value that compared to conciseness (in practise anyways, people always claim to value readability, but their actions often differ)
I queried google for '"SNOBAL" acronym' and am no wiser than before, what do you mean? Also, I like to think of myself as a "real" programmer and I take great care to produce readable code :-)
I do a similar thing as suggested by the article, except by using Python's "concat between parenthesis" strings instead of Python's heredoc strings. The advantage of doing it this way is that there are no caveats (as mentioned in the article) with needing to unexpectedly escape certain characters.
You might already be aware of this, but 'pegs' likely refers to Parsing Expression Grammars [1], a super powerful and imho very chill concept which translates into great tooling in lots of languages.
Never used them, but don't they have memory usage in proportion to the string being parsed? Regexes, LL and LR don't have this shortcoming. This should surely constrain their applications.
I decided to go looking for when Perl got verbose regex. The oldest thing I can find on perldoc.perl.org or cpan.org is 5.004 (1997), where they were an existing feature.
EDIT: Found 4.036 sources (1993). A quick scan of the man page (troff source!) does not find verbose regular expressions. So it looks like they were introduced very early in the Perl 5 series.
i can't find it now, but i think there was a larry wall quip along the lines of, "from all the concepts to borrow from perl, why did python take regex?"
It looks similar, but the semantics are quite different. A Raku grammar is a recursive deacent parser, but this is still a regular expression in the end.
I don't want to be that guy, but why in the world are f-strings (formatted string literals) called literals? They are clearly dynamically calculated expressions.
There’s two definitions of “literal” in widespread use.
The first definition is as you say: an expression that has a constant value.
The second definition is: an expression that is the primary syntactic form to construct a type. For example: “array literals” construct arrays, but may contain arbitrary expressions within.
The first definition is more common in low-level languages where there is a place in the compiled executable to put constant data. These languages might call the second form an initializer rather than a literal. But in a dynamic language such as Python the distinction is less important.
Lisp uses the first definition. And it's as dynamic as they come. Java also uses the first definition.
I suspect a more proper description would be: #1 is the correct definition, formally used for 80 years now. #2 is incorrect, and is being abused by people in JS and Python who should know better.
They're a weird combo of compile-time parsing and run-time expressions, with custom opcode. Early on they didn't have expressions but worked like .format(). Then they added expressions and the PEP title needed differentiation from it. Not entirely accurate title is now set in stone.
I would have called them "interpolated strings" or even e-string but the f-string moniker had already caught on and there was no stopping it.
99 times out of 100 when I think I might need a regular expression, I find it far better to code the search in python directly rather than using the regular expression engine at all. It's far easier to understand and you can run a regular debugger on it and use regular comments. In the 100'th case, I'll code most of the expression in straight python and a very small piece using the regular expression engine.
By straight python I mean things like 'for', 'split', 'startswith', 'find', and regular character indexing.
So for me this post is a solution to a problem that I just avoid.
Programmers adding features to languages the same way horror movie characters decide on entering an abandoned shack in the woods.
No. f-strings are an awful idea. And combining them with r-strings is yet another awful idea. Please, never do that.
I'm also not a big fan of extended / verbose regular expressions because it creates ambiguity in interpretation (a slightly different language to define regular expressions). It's a bad solution to the problem of building longer expressions, which should've been addressed in a different way: by making the language of regular expressions more modular, not through allowing more hard-to-interpret language details.
“Here's the plan: When someone uses a feature you don't understand, simply shoot them. This is easier than learning something new, and before too long the only living coders will be writing in an easily understood, tiny subset of Python 0.9.6 <wink>.”
Can’t named pattern groups do the same thing the f-strings do here? (As a bonus named patterns work in old Python 2 code, which yes nobody should be using anymore, but just sayin’) I don’t have an opinion or even good mental model about the advantages of either choice, but I was pretty excited to learn about named patterns and promptly used it to make a hacky (but interesting to me) small parser for identifying tokens and keywords and operators with differing precedence.
this is a game changer. i did not know f-string could be used like this, i was largely satisfied with being able to not have to use the awful 2.7 era percentage symbols.
even more reason to love python, right now for me slots and dataclasses is my new obsession. there was a great article that was posted here that went into details about 3.8 and up that featured all these great python hacks
This chart alone should be telling. Ruby has more learning curve, smaller set of libraries and stable but slow rate of innovation. The returns aren't that great for something like Python developers who can be trained on most tool in its ecosystem.
I agree with your point. To throw fly in the ointment and be a devil's advocate, how do we know the question load for RoR isn't just decreasing because there aren't many questions left to answer?
I started writing a reply to your comment in an attempt to dissuade you from such heretical propaganda but then I decided to just hotlink this: https://imgs.xkcd.com/comics/duty_calls.png
An f-string evaluates to a string and not to an object such as a compiled regex. For this there are tagged template literals in Javascript (which got them from E). Example: https://github.com/erights/quasiParserGenerator
So you either call re.compile on it or, as in the example in the article, you call one of the re module's functions that takes a pattern string as an argument.
Parsing strings assembled out of strings is classically bug-prone. In the alternative I'm pointing out, you don't fill the holes with other strings, you fill them with already-parsed regex objects. I think it's a shame Python didn't follow this design, which predated f-strings.
> Parsing strings assembled out of strings is classically bug-prone.
I think the bigger problem is not so much that regexes are generally defined by strings specifying the desired regex, but that at some point between the development of printf and the development of regex libraries people forgot that they could use whatever escape character they wanted when implementing a new conceptual data type. In C, the compiler deals with strings, and the printf function deals with strings, and they try not to conflict with each other by assigning the escape character \ to the compiler while printf uses % instead.
But in Java, and Python, and presumably many, many other languages, some idiot decided that if strings used \ for their escape character, regex functions should also use \ for their escape character. Since they accept strings, suddenly the regex escape character is actually "\\". How do you match a single literal backslash? "\\\\", obviously. What's wrong with that?
I meant the design of f-strings, not of regexes. I must've written my comment very badly since almost all of these replies seem to take this misinterpretation.
(E did have this concept back in '97, iirc, though yeah I wasn't expecting Guido to have run into it then, that wasn't when I meant.)
I think the issue is that f-strings have nothing specifically to do with regex specifically, or parsing generally. They're a formatting API. So the complaint that you wish the formatting API made regexes a first class citizen is odd and suggests a misunderstanding somewhere.
An f-string fills holes in a format string to build a string.
A template literal parses a template with holes to produce any datatype you like, filling it with arguments of any appropriate type.
This does for Javascript what f-strings do for Python (with similar syntax and simplicity), but also more. Besides the greater expressiveness, it can catch bugs, in the same way that Lisp macros are safer than C macros. It's not hardwiring regexes, it's not hardwiring any datatype: it calls a function that you name in the tag, which you can define to do whatever parsing and filling is proper.
Lisp's quasiquotation is similar in spirit though different in appearance.
This is my last try to explain here. I guess this thread shows that tagged template strings are much less well known on HN than I thought. (Yes it also shows me I was unusually bad at communicating.)
> the complaint that you wish the formatting API made regexes a first class citizen is odd
That was not what I was trying to say. The template literal mechanism knows nothing about regexes. Regexes are just one particular type and one particular syntax.
I must have a serious bug in my writing about this (sorry), because this was never about regex engines -- it's about literals and domain-specific sublanguages in general. Composing DSL programs by string concatenation is such a famous source of security bugs you see it in top-10 lists. I linked to the very similar example of a PEG-parsing DSL.
Since you explicitly talked about filling the fields in an f string with already parsed regex objects instead of strings, it's hard to see what else you could mean. But even if I s/regex engine/DSL parsing engine in general/, I would like to see an actual example of a language or library where I can have a string like a Python f-string whose fields can be filled with some kind of parsed "engine" object instead of another string.
> Composing DSL programs by string concatenation is such a famous source of security bugs you see it in top-10 lists.
I don't see how composing DSL programs by filling in string fields with parsed "engine" objects is much better. I personally don't like regexes in general because I find them too hard to reason about unless they're extremely simple (and regexes that simple usually aren't necessary). I would rather try to write library functions (which might include functions that build other functions) in the same language as the rest of my program.
Yeah, sorry I didn't explain tagged template literals, just linked to an example.
You could make one called, say, rx for regex. Then
rx`${a}|${b}`
would evaluate to exactly the same result as a tree constructor call like
regex_or(a, b)
given corresponding definitions of rx and of regex_or. There's never any question of whether a and b are escaped right. So it brings the composability advantage of the library functions you prefer, to people who want to write these concrete-syntax regular expressions that started the whole thread.