Hacker News new | past | comments | ask | show | jobs | submit login
The unreasonable effectiveness of f-strings and re.verbose (andgravity.com)
304 points by genericlemon24 on May 23, 2022 | hide | past | favorite | 90 comments



PCRE has builtin support for this kind of factoring, too:

  (?(DEFINE)
    (?<code>
      [A-Z]*H  # prefix
      \d+      # digits
      [a-z]*   # suffix
   )
    (?<multicode>
      (?: \( \s* )?               # maybe open paren and maybe space
      (?&code)                    # one code
      (?: \s* \+ \s* (?&code) )*  # maybe followed by other codes, plus-separated
      (?: \s* [\):+] )?           # maybe space and maybe close paren or colon or plus
    )
  )
  ( (?&multicode) )           # code (capture)
  ( .*? )                     # message (capture): everything ...
  (?=                         # ... up to (but excluding) ...
      (?&multicode)           # ... the next code
          (?! [^\w\s] )       # (but not when followed by punctuation)
      | $                     # ... or the end
  )


If the regular expression engine accepted tree structures instead of just strings, you could have first class definitions of fragments of regular expressions. Even better, you could define them as functions, so you could have parameterized fragments. So then you could just apply something like http://edicl.github.io/cl-ppcre/#create-scanner2 on the resulting expression tree without having to use the bizarre definition syntax above.


I'm working on this right now for sed :). It currently works with both GNU and BSD-style sed, either for BREs or EREs. I'm going to make it easier to install sometime soon and then hopefully expand the project to other languages.

https://github.com/anthonyrgreen/sedx-compiler


And because the PCRE library is integrated in a huge number of languages (it's almost hard to find a language that doesn't have it - I'm looking at you JavaScript), these types of REGEXs are actually widely available.


`(?N)` where `N` is group number and `(?&name)` where `name` is named group are known as subexpression calls. The third-party `regex` module (https://pypi.org/project/regex/) supports this and more such PCRE features.


I mention PCRE in passing at the end of the article, but I didn't know about (?(DEFINE)...); that's very, very cool!


Afair, so does onigmo/oniguruma, with a mildly different syntax


The real kicker, hidden between everything, is that you can combine f-strings and r-strings.

    fr”this is both an f- and an r-string”

I had no idea. I wish Python allowed for custom string types. I would love a sql string type if for nothin else than to show my code editor how to highlight inside the string!


You _technically_ can create custom string types ... they just aren't very easy to integrate with other peoples code: https://pypi.org/project/future-fstrings/


I've used an extension that does this in the past to great effect. I can't seem to find it right now, but here's a similar one: https://marketplace.visualstudio.com/items?itemName=qufiwefe...


Those quotes should be "ascii" quotes, not unicode.


Yes they should. I’m pretty proud I managed to keep the f lowercase, given that I typed this on an iPhone …


That’s not true, though! It’s written for humans to read, not for machines to parse, and any human reading this will realize what they’re supposed to be.


Human here. I don’t know what they are supposed to be - seriously. Could you explain please?


I think he means that in the code you'll want to use " rather than ”. ” often end up happening when copy pasting code from some websites or apps like Word and they often cause trouble because they are not parsed as quote characters by compilers/interpreters.


Code is very often written to be copied and pasted though! And you could be surprised by the amount of people not noticing the difference between the quotes.


Those people will notice when their compiler complains and will hopefully know better than to copy+paste something from an untrusted website next time.


This code is very obviously just illustrating a point to the people reading it, it seems unlikely that anyone would want to copy and paste it. Lots of code snippets are incorrect code. For example, I often write C code like this:

    int x = ...;
The line contains a syntax error, but I’m communicating to the people reading that x is initialized to some value.


Ok then - the left one's the wrong way around.


If you understand what the quotes are supposed to be, what is the problem?


Oh ho ho ho… this is good. Anything that breaks apart regexes to make them easier to read and comment each logical unit is worth the extra lines and syntax. This is killer.


There have been many attempts at doing this over the decades. It never catches on. I don't think its something developers really want.


It looks like this particular one has had some staying power, because it (or the PCRE version mentioned elsewhere in the comments) have been rather widely implemented.

I didn't know about this and was a happy regular expression user without it, but this looks like a good feature for the specific use case of wanting other people to understand the structure of your regular expressions. And much more portable than I would have expected.


Strongly disagree. I’ve seen this pattern used heavily inside many ruby codebases, it’s incredibly useful for making regular expressions readable(ish)


I’ve been doing this for years. Splitting up regexes and reusing sub-patterns is very common in new code.


I suggest you take a look at parser combinators. They are quite readable compared to regexes


They are totally different solutions.

A regex is just a fast, usually integrated into the language, universally understood way to do some simple parsing. Parser combinators are an amazing specialized tool for building parsers, but one that is generally harder to integrate into a code-base (outside of e.g. Haskell), requires lesser known libraries, and is a paradigm that people need time to get used to.

It's like saying people shouldn't use a mitre-box [1] and instead use a full fledged circular mitre-saw [2]. Yes the second tool is much more versatile, powerful and useful. But it requires much more setup, skill and investment to actually use.

[1] https://en.wikipedia.org/wiki/Mitre_box [2] https://www.toolstation.com/power-tools/mitre-saws/c317


So is SNOBAL. i don't disagree that long regexes are difficult to read only that real programmers dont really value that compared to conciseness (in practise anyways, people always claim to value readability, but their actions often differ)


I queried google for '"SNOBAL" acronym' and am no wiser than before, what do you mean? Also, I like to think of myself as a "real" programmer and I take great care to produce readable code :-)


Parent comment probably meant SNOBOL.


Thank you. For the uninitiated: https://en.wikipedia.org/wiki/SNOBOL see the second paragraph.


I am a developer and I really want it. For the next longr regex I am going to use it for sure, as I like to use f-strings extensively already.


They might not see the value when they are writing a new regex, but they’ll miss it when they are reading an existing one.


I do a similar thing as suggested by the article, except by using Python's "concat between parenthesis" strings instead of Python's heredoc strings. The advantage of doing it this way is that there are no caveats (as mentioned in the article) with needing to unexpectedly escape certain characters.

It looks like this:

    pattern = (
      r'[A-Z]*H'  # prefix
      r'\d+'      # digits
      r"[a-z']*"  # suffix
    )
No funky stuff with escapes, and you can indent to your heart's content.


This reminds me quite a lot of the "pegs" module in Nim.

https://nim-lang.org/docs/pegs.html

    identifier <- [A-Za-z][A-Za-z0-9_]*
    charsetchar <- "\\" . / [^\]]
    charset <- "[" "^"? (charsetchar ("-" charsetchar)?)+ "]"
Thats a small snippet of how it is used. It's been one of my favourite parts of using Nim to be honest.

The fact one can get similar ergonomics this way in straight Python is wonderful! I'm definitely going to leverage this.

I've done similar in other languages, but it's never felt quite right. re.VERBOSE is also handy to know.


You might already be aware of this, but 'pegs' likely refers to Parsing Expression Grammars [1], a super powerful and imho very chill concept which translates into great tooling in lots of languages.

1 https://en.wikipedia.org/wiki/Parsing_expression_grammar


You surmised right, though I adore Nim's particular implementation of it compared to the times I've attempted it in, say, Javascript :)

Parsing expression grammars are easily one of my favourite tools. Honestly, I find them superior to regexes.


Never used them, but don't they have memory usage in proportion to the string being parsed? Regexes, LL and LR don't have this shortcoming. This should surely constrain their applications.


Do sources still exist for Perl 4 and earlier?

I decided to go looking for when Perl got verbose regex. The oldest thing I can find on perldoc.perl.org or cpan.org is 5.004 (1997), where they were an existing feature.

EDIT: Found 4.036 sources (1993). A quick scan of the man page (troff source!) does not find verbose regular expressions. So it looks like they were introduced very early in the Perl 5 series.

https://www.cpan.org/src/unsupported/4.036/


This looks like it's approaching the logical neighborhood where parser combinators live. I'm a fan of parser combination.


It's pretty much exactly what Perl 6 / Raku grammars are: https://docs.raku.org/language/grammars


i can't find it now, but i think there was a larry wall quip along the lines of, "from all the concepts to borrow from perl, why did python take regex?"


It looks similar, but the semantics are quite different. A Raku grammar is a recursive deacent parser, but this is still a regular expression in the end.


I don't want to be that guy, but why in the world are f-strings (formatted string literals) called literals? They are clearly dynamically calculated expressions.


There’s two definitions of “literal” in widespread use.

The first definition is as you say: an expression that has a constant value.

The second definition is: an expression that is the primary syntactic form to construct a type. For example: “array literals” construct arrays, but may contain arbitrary expressions within.

The first definition is more common in low-level languages where there is a place in the compiled executable to put constant data. These languages might call the second form an initializer rather than a literal. But in a dynamic language such as Python the distinction is less important.


Lisp uses the first definition. And it's as dynamic as they come. Java also uses the first definition.

I suspect a more proper description would be: #1 is the correct definition, formally used for 80 years now. #2 is incorrect, and is being abused by people in JS and Python who should know better.


They're a weird combo of compile-time parsing and run-time expressions, with custom opcode. Early on they didn't have expressions but worked like .format(). Then they added expressions and the PEP title needed differentiation from it. Not entirely accurate title is now set in stone.

I would have called them "interpolated strings" or even e-string but the f-string moniker had already caught on and there was no stopping it.


99 times out of 100 when I think I might need a regular expression, I find it far better to code the search in python directly rather than using the regular expression engine at all. It's far easier to understand and you can run a regular debugger on it and use regular comments. In the 100'th case, I'll code most of the expression in straight python and a very small piece using the regular expression engine.

By straight python I mean things like 'for', 'split', 'startswith', 'find', and regular character indexing.

So for me this post is a solution to a problem that I just avoid.


“Unreasonable effectiveness” in titles considered harmful.


Programmers adding features to languages the same way horror movie characters decide on entering an abandoned shack in the woods.

No. f-strings are an awful idea. And combining them with r-strings is yet another awful idea. Please, never do that.

I'm also not a big fan of extended / verbose regular expressions because it creates ambiguity in interpretation (a slightly different language to define regular expressions). It's a bad solution to the problem of building longer expressions, which should've been addressed in a different way: by making the language of regular expressions more modular, not through allowing more hard-to-interpret language details.


> No. f-strings are an awful idea. And combining them with r-strings is yet another awful idea.

why?


“Here's the plan: When someone uses a feature you don't understand, simply shoot them. This is easier than learning something new, and before too long the only living coders will be writing in an easily understood, tiny subset of Python 0.9.6 <wink>.”

― Tim Peters


If you have a number of curly braces in the pattern, probably easier to use printf-style formatting to build the pattern, with %s etc.


Now you have two problems. Or in this case, probably more than two.



Okay wait a minute, what's that about the subtitles? Isn't that too small of a regex to accurately classify all the subtitles?


According to this site[0] it's about the name of the movies, not a about the subtitles of all the dialog in the movies.

For example the title is "Star Wars", the subtitle is "The Empire Strikes Back"

[0] https://www.explainxkcd.com/wiki/index.php/1313:_Regex_Golf


Ahh OK, that makes much more sense, thanks!


Can’t named pattern groups do the same thing the f-strings do here? (As a bonus named patterns work in old Python 2 code, which yes nobody should be using anymore, but just sayin’) I don’t have an opinion or even good mental model about the advantages of either choice, but I was pretty excited to learn about named patterns and promptly used it to make a hacky (but interesting to me) small parser for identifying tokens and keywords and operators with differing precedence.


> As a bonus named patterns work in old Python 2 code

So does string formatting. You don't need f-strings for this pattern to work.


this is a game changer. i did not know f-string could be used like this, i was largely satisfied with being able to not have to use the awful 2.7 era percentage symbols.

even more reason to love python, right now for me slots and dataclasses is my new obsession. there was a great article that was posted here that went into details about 3.8 and up that featured all these great python hacks


Good god. Your mind would be absolutely blown by what you can do in Ruby.


What can you do in Ruby?


was that snark necessary? this article is about python. i have no interest in ruby


I read it as good natured humor - clearly dramatic. Just sharing my opinion, fwiw, I see how you might interpret it differently.


having interests in other languages is important


Aha I think you mean Rails, there can't be anyone left using it for anything outside of that.


Web development has gotten much more complex these days and while RoR still strong, Python has gone "mainstream"

https://www.guru99.com/images/1/021020_0523_PythonvsRub1.png

This chart alone should be telling. Ruby has more learning curve, smaller set of libraries and stable but slow rate of innovation. The returns aren't that great for something like Python developers who can be trained on most tool in its ecosystem.


I agree with your point. To throw fly in the ointment and be a devil's advocate, how do we know the question load for RoR isn't just decreasing because there aren't many questions left to answer?


The majority of Amazon’s infrastructure as code is in Ruby ;)


I started writing a reply to your comment in an attempt to dissuade you from such heretical propaganda but then I decided to just hotlink this: https://imgs.xkcd.com/comics/duty_calls.png


An f-string evaluates to a string and not to an object such as a compiled regex. For this there are tagged template literals in Javascript (which got them from E). Example: https://github.com/erights/quasiParserGenerator


So you either call re.compile on it or, as in the example in the article, you call one of the re module's functions that takes a pattern string as an argument.


Parsing strings assembled out of strings is classically bug-prone. In the alternative I'm pointing out, you don't fill the holes with other strings, you fill them with already-parsed regex objects. I think it's a shame Python didn't follow this design, which predated f-strings.


> Parsing strings assembled out of strings is classically bug-prone.

I think the bigger problem is not so much that regexes are generally defined by strings specifying the desired regex, but that at some point between the development of printf and the development of regex libraries people forgot that they could use whatever escape character they wanted when implementing a new conceptual data type. In C, the compiler deals with strings, and the printf function deals with strings, and they try not to conflict with each other by assigning the escape character \ to the compiler while printf uses % instead.

But in Java, and Python, and presumably many, many other languages, some idiot decided that if strings used \ for their escape character, regex functions should also use \ for their escape character. Since they accept strings, suddenly the regex escape character is actually "\\". How do you match a single literal backslash? "\\\\", obviously. What's wrong with that?


> I think it's a shame Python didn't follow this design

I mean python's regex module was added in 1997 and hasn't fundamentally changed since. I don't think that concept was super common in '97.


I meant the design of f-strings, not of regexes. I must've written my comment very badly since almost all of these replies seem to take this misinterpretation.

(E did have this concept back in '97, iirc, though yeah I wasn't expecting Guido to have run into it then, that wasn't when I meant.)


I think the issue is that f-strings have nothing specifically to do with regex specifically, or parsing generally. They're a formatting API. So the complaint that you wish the formatting API made regexes a first class citizen is odd and suggests a misunderstanding somewhere.


Evidently I was still unclear:

An f-string fills holes in a format string to build a string.

A template literal parses a template with holes to produce any datatype you like, filling it with arguments of any appropriate type.

This does for Javascript what f-strings do for Python (with similar syntax and simplicity), but also more. Besides the greater expressiveness, it can catch bugs, in the same way that Lisp macros are safer than C macros. It's not hardwiring regexes, it's not hardwiring any datatype: it calls a function that you name in the tag, which you can define to do whatever parsing and filling is proper.

Lisp's quasiquotation is similar in spirit though different in appearance.

This is my last try to explain here. I guess this thread shows that tagged template strings are much less well known on HN than I thought. (Yes it also shows me I was unusually bad at communicating.)

> the complaint that you wish the formatting API made regexes a first class citizen is odd

That was not what I was trying to say. The template literal mechanism knows nothing about regexes. Regexes are just one particular type and one particular syntax.


Can you give an example of a regex engine that has the design you describe?


I must have a serious bug in my writing about this (sorry), because this was never about regex engines -- it's about literals and domain-specific sublanguages in general. Composing DSL programs by string concatenation is such a famous source of security bugs you see it in top-10 lists. I linked to the very similar example of a PEG-parsing DSL.

But any regex engine that can work with a parse tree shows the same principle, e.g. https://edicl.github.io/cl-ppcre/#create-scanner2


> this was never about regex engines

Since you explicitly talked about filling the fields in an f string with already parsed regex objects instead of strings, it's hard to see what else you could mean. But even if I s/regex engine/DSL parsing engine in general/, I would like to see an actual example of a language or library where I can have a string like a Python f-string whose fields can be filled with some kind of parsed "engine" object instead of another string.

> Composing DSL programs by string concatenation is such a famous source of security bugs you see it in top-10 lists.

I don't see how composing DSL programs by filling in string fields with parsed "engine" objects is much better. I personally don't like regexes in general because I find them too hard to reason about unless they're extremely simple (and regexes that simple usually aren't necessary). I would rather try to write library functions (which might include functions that build other functions) in the same language as the rest of my program.


Yeah, sorry I didn't explain tagged template literals, just linked to an example.

You could make one called, say, rx for regex. Then

    rx`${a}|${b}`
would evaluate to exactly the same result as a tree constructor call like

    regex_or(a, b)
given corresponding definitions of rx and of regex_or. There's never any question of whether a and b are escaped right. So it brings the composability advantage of the library functions you prefer, to people who want to write these concrete-syntax regular expressions that started the whole thread.


Python doesn't have regex literals.


Right. A tagged template literal can denote a regex object, not just a string.


This looks like perlre /x.


I mean it's definitely better, but if you end up with a regex that long you really shouldn't be using regex.


This seems like a weird, and possibly untrustworthy, hack. Does Python not have the equivalent of Ruby's `x` modifier?


Yep, python has the same option, it's called re.VERBOSE and aliased as re.X.

https://docs.python.org/3/library/re.html#re.X


What is untrustworthy about it?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: