
Is there a regular expression to detect a valid regular expression? - lelf
https://stackoverflow.com/questions/172303/is-there-a-regular-expression-to-detect-a-valid-regular-expression
======
tobinfricke
> Is there a regular expression to detect a valid regular expression?

No, there is not. For example, parentheses in a regex must be balanced, and
(famously) there is no regex to detect balanced parentheses.

~~~
tgv
It depends on the representation. If the representation is in the FSA format
(state, character, state), it becomes nearly trivial.

~~~
morelisp
An analogous problem for this representation would be detecting transitions to
undefined states, or ambiguous (duplicate) transitions.

~~~
tgv
I didn't verify it, but my intuition says that can't be done with a CFG; it
seems at least context-sensitive.

------
noobiemcfoob
"Evaluate it in a try..catch or whatever your language provides."

"That's not very enterprisey of you"

Oooh, I'm laughing so hard it hurts. It's been a particularly 'enterprisey'
week at work.

~~~
braythwayt
That answer feels entirely wrong to me. As pointed out, it may be an XY
answer: It presumes the person posing the question wants to validate regular
expressions.

But with something so blatantly self-referential, it actually feels unlikely
to me that what they want to do is validate regular expressions. My guess is
that they are generally curious about whether Regexen (PCRE or strict regular
expressions) are powerful enough to validate Regexen (again whether PCRE or
strict).

XY answers are good for avoiding a lot of unnecessary yak-shaving/accidental
complexity of a bad solution. But the conversation around whether we are
talking about recognizing strict Regexen or PCRE Regexen, and in turn whether
we are using strict Regexen or PCRE Regexen to recognize them is not
accidental complexity or yak-shaving, it is intrinsic to understanding the
nature of the problem and solution spaces.

I too find the answer humorous for the "enterprisey" reference, but I think it
would be a very bad answer if we are judging it strictly on the basis of its
value.

------
gamache
I believe Zalgo has the answer to this, via an equivalent question.
[https://stackoverflow.com/questions/1732348/regex-match-
open...](https://stackoverflow.com/questions/1732348/regex-match-open-tags-
except-xhtml-self-contained-tags)

~~~
abraCadabstrax
Perhaps the most epic reply I have ever seen on SO.

~~~
hk__2
It’s a famous post on StackOverflow, but I don’t find it particularly helpful.

~~~
joppy
I don't think that answer was written with the intent of being particularly
helpful, I think it was written with a different goal in mind.

~~~
jacobush
It _is_ helpful in that almost koan way though.

------
fortran77
We see good examples of "the problem with StackOverflow" here.

The second highest-rated answer is "Evaluate it in a try..catch or whatever
your language provides." and it's justified because "Surely the real question
is 'how do I validate a regular expression'."

This is a fascinating computer science question and I'm pretty sure the
questioner wasn't asking "how do I validate a regular expression" because he
would have asked that.

~~~
thereare5lights
> I'm pretty sure the questioner wasn't asking "how do I validate a regular
> expression" because he would have asked that.

In my experience, people sometimes ask for how to solve the more immediate
detail they’re working on rather than the broader problem.

~~~
corobo
This came up recently. Going to start calling it the XYZ problem

[https://news.ycombinator.com/item?id=20861806](https://news.ycombinator.com/item?id=20861806)

Edit: page refreshed and sure enough the sibling comment calls it out by name

Edit2: Called it the XYZ problem: [https://cohan.io/the-xyz-
problem/](https://cohan.io/the-xyz-problem/)

------
GuB-42
An interesting question would be "which regex variant can be parsed by itself,
and how?".

Parsing simple regex with PCRE is cheating. If you are using PCRE, you should
be parsing PCRE.

The simplest case would be DOS style globs with just * and ?, and it can parse
itself: just use "*".

------
kyberias
I hate it how the accepted answer is incorrect and does not teach the
fundamental property of regular expressions.

~~~
unlinkr
It only works in practice, not in theory.

~~~
yifanl
It doesnt work in practice, PCRE does not validate all valid PCRE expressions.

------
z3c0
The answer is as horrifying as you'd expect, but impressive nonetheless.
Recursive regex is really fun and powerful, when it works.

I once made a recursive regex for matching a full name, complete with checking
for suffixes, prefixes, an undefined amount of middle names, hyphenated last
names, and Scotch-Irish names (Mc-, O'-, etc). Still simpler than this
monstrosity.

------
jraph
My first thought seeing this title: "Why, no. Next!"

My second thought: the voice of Linus Torvalds at DebConf 14 saying "Hum… No!
Hum… that was quick." [1]

Though this is only speaking about a recursively defined regular expression
language which is infinite, which strictly handles regular languages, as
defined in computer science lessons in university.

And then, "Why not just try and see if it breaks the provided regex parser
since you have one?", and it's actually one of the answers in the link…
awesome. I wonder it is has security implications though (are forged regexes
exploiting flawed regex parsers a thing?)

[1] [https://youtu.be/5PmHRSeA2c8?t=110](https://youtu.be/5PmHRSeA2c8?t=110)

~~~
gpm
> are forged regexes exploiting flawed regex parsers a thing?

Looks like yes, depending on the engine.

PCRE for instance has a long list of security vulnerabilities including some
with arbitrary code execution: [https://www.cvedetails.com/vulnerability-
list.php?vendor_id=...](https://www.cvedetails.com/vulnerability-
list.php?vendor_id=3265&product_id=0&version_id=0&page=1&hasexp=0&opdos=1&opec=0&opov=0&opcsrf=0&opgpriv=0&opsqli=0&opxss=0&opdirt=0&opmemc=0&ophttprs=0&opbyp=0&opfileinc=0&opginf=0&cvssscoremin=0&cvssscoremax=0&year=0&cweid=0&order=1&trc=38&sha=3be670e89a2d45bd0df8736dd00d2179b59636da)

------
andrewflnr
I've always been mildly amused by the fact that while regular expressions are
a context-free language, context-free grammars are a regular language.

------
tannhaeuser
Basic classical regular expression syntax without grouping parentheses is
trivially regular. With parentheses, the maximum nesting level must be bounded
to remain regular because "regexpes can't count".

------
mkagenius
It may be noted that regular languages (as in compiler design) do not have
such recursive constructs.

------
kmstout
No. I wrote about this awhile back
([https://reindeereffect.github.io/2018/06/24/regex/#the-
limit...](https://reindeereffect.github.io/2018/06/24/regex/#the-limits-of-
regular-expressions)). From that post:

[M]atching parentheses requires our recognition device to remember how many
unmatched open parentheses there are. Since the only way for a DFA to remember
anything is to be in one of a set of states corresponding to it, and since the
unmatched open parentheses could easily outnumber the available states, we can
see that the fundamental limitation of a DFA is that it can store only a
finite amount of information (remember the ‘F’ in DFA?). This limitation
applies to any string matching task that involves recursive structures or
algebraic relationships between substrings. It is why “HTML and regex go
together like love, marriage, and ritual infanticide.”

------
ac42
IIRC, regex can be implemented by an NFA which can't solve balanced
parentheses that are part of regex syntax.

~~~
mikejb
Correct, though there are multiple interpretations of what a "regex" is. In
the theoretical sense you're correct, though some (most programmers, I'd say)
mean "PCRE" when they say regex. And PCREs are pretty powerful, see [1] for
example

[1]
[https://news.ycombinator.com/item?id=9748736](https://news.ycombinator.com/item?id=9748736)

------
rocqua
So, given the much discussed limitations of reg-exps and the desire to parse
context-free grammars. My question is, why are we still using regular
expressions. Or rather, why isn't there something as easy to use as regular
expressions that _can_ processes context-free grammars?

~~~
zokier
Something like Perl6 grammars[1], or maybe Rosie Pattern Language[2]? Of
course Perl6 regexes also go well beyond regular expressions, and I suspect
they could be used to match context-free grammars if pressed hard enough. Both
P6 grammars and RPL are based on parsing expression grammars, and there are
also tools/libraries for many other languages based on PEGs. But now you are
entering in the scary realm of parsers and parser generators and all that
jazz, and can debate how easy to use they really are.

[1]
[https://docs.perl6.org/language/grammars](https://docs.perl6.org/language/grammars)

[2] [https://developer.ibm.com/open/projects/rosie-pattern-
langua...](https://developer.ibm.com/open/projects/rosie-pattern-language/)

~~~
wahern
PEGs are amazing precisely because they're as easy to use as, if not easier
than, common regular expression syntax. PEGs are literally the same as regular
expressions except 1) alternations are _ordered_ , 2) zero-width assertions
are formalized, and 3) quantifiers match greedily. This is effectively the
same behavior as the Perl-compatible regular expressions with which most
people are familiar.

Many PEG engines, especially for dynamic languages, permit grammar composition
using first-class variables. That might be a small barrier to people more
familiar with the terseness and conceptual simplicity of regular expressions
as string'ish values. But it's fairly trivial to implement the latter using
PEGs. For example, LPeg provides a small auxiliary module for doing that:
[http://www.inf.puc-rio.br/~roberto/lpeg/re.html](http://www.inf.puc-
rio.br/~roberto/lpeg/re.html)

Also, Rosie seems amazing. I've not yet had the opportunity to make use of it,
but I attended a presentation of Rosie by the author at a Lua workshop which
left me very impressed.

------
timeattack
This is perfect example of a insight which leads to a rabbit hole of
intertwined complexity of our concepts when you try to understand why regexp
for validating regexp is actually much simpler than regexp for validating
email address.

~~~
ChrisSD
It depends what you mean by "validating email addresses".

A regex that tests if a string looks like a valid email address is simple.
Forget the ancient RFCs, a real world email address is in the form of
`mailbox@domainname`. Which is not so difficult to test for with a bit of
care.

However, testing if the email address is a valid mailbox is harder and indeed
impossible using regex alone. The domain name can be validated using standard
domain tools but in practice the mailbox can be anything that the server will
route to a valid mailbox. The only way to validate it is to send an email.

~~~
dillonmckay
Even then, there is no way to determine if there is a ‘catch-all’ address
configured, or some email tarpit.

I have a domain w/ a .tech TLD, and quite a few frontend JS validators do not
accept my email as a valid address (government sites, some banks).

------
proverbialbunny
Maybe I'm taking this a step too far but doesn't Gödel's Incompleteness
Theorems
([https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_...](https://en.wikipedia.org/wiki/G%C3%B6del%27s_incompleteness_theorems))
state that a language can not define itself? You need a meta language to be
able to define and specify a language in its entirety.

In other words, regex can not parse regex in its entirety. It's impossible.

~~~
wahern
Can a C program parse C source code? A Java program parse Java source code?
Yes, they can, so such a general limitation couldn't be the reason regular
expressions can't parse themselves.

Perhaps you had in mind the Halting Problem:
[https://en.wikipedia.org/wiki/Halting_problem#G%C3%B6del's_i...](https://en.wikipedia.org/wiki/Halting_problem#G%C3%B6del's_incompleteness_theorems)

------
cbarrick
"Regular Expression Matching Can Be Simple And Fast" [1] is worth a read (by
Russ Cox of Bell Labs and Go fame).

[1]:
[https://swtch.com/~rsc/regexp/regexp1.html](https://swtch.com/~rsc/regexp/regexp1.html)

------
seanmcdirmid
A regular language for writing regular language recognizers is an interesting
thought exercise.

------
notyourday
Don't use a regex to do it. Most of languages have a way to catch runtime
errors. Stick regex to a variable, create a dummy input and run that block
with a catcher around it. If catcher throws a runtime error, you have a bad
regex. If it does not, you have a valid regex.

------
a3n
Another way to phrase the question: Are regexes self hosting?

------
Piskvorrr
TL;DR: No. But there exist languages built _on top of_ regular expressions
(notably PCRE) that can. They can't validate _themselves_ , though - turtles
all the way down to Gödel.

~~~
tom_mellior
Plenty of parser formalisms can define their own syntax in their own syntax.
As mentioned elsewhere, C compilers can parse their own source code. Gödel has
nothing to do with it.

------
anewguy9000
it would become conscious

------
ChrisSD
Why would you want a regular expression to detect a valid regular expression?

~~~
afiori
Probably to check if a regular expression is valid

~~~
ChrisSD
Yes, checking to see if a regular expression is valid is useful but that
wasn't what I was asking. Why specifically would you want to use a regular
expression for this job?

~~~
afiori
The simplest explanation is because you already have a lot of functionality
implemented this way.

Still probably the library you are using exposes a validation function...

------
papito
There isn't even a regular expression to detect a valid _email_ , bruh.

------
tyingq
I did enjoy the comment with the Xhibit/Pimp my Ride reference.

