
Ask HN: Why no regex AND? - _acme
Why does the syntax for regular expressions include an &quot;or&quot; (union) operator but not an &quot;and&quot; (intersection) operator?
======
dalke
FWIW,
[http://adsabs.harvard.edu/abs/2008arXiv0802.2869G](http://adsabs.harvard.edu/abs/2008arXiv0802.2869G)
says: "Similarly [to complement], when constructing a regular expression
defining the intersection of a fixed and an arbitrary number of regular
expressions, an exponential and double exponential size increase,
respectively, can in worst-case not be avoided."

I found the 'greenery' project, which says, at
[https://qntm.org/greenery](https://qntm.org/greenery) :

> Elementary automata theory tells us that the intersection of any two regular
> languages is a regular language, but carrying out this operation on actual
> regular expressions to generate a third regular expression afterwards is
> much harder than doing so for the other operations under which the regular
> languages are closed (concatenation, alternation, Kleene star closure).

The developer has code, which lets you do:

    
    
      >>> from greenery.lego import parse
      >>> print(parse("(ab{0,3})*") & parse("(abba)*"))
      (ab{2}a)*
      >>> print(parse("((ss*)t*)") & parse("((ss*)+(tt*))"))
      s+t+
    

I have no other experience with the code, but it was nice to know that such a
package exists.

The author also wrote that it was "the most algorithmically complex thing I've
ever implemented."

------
dice
A regex is one big implicit AND already. The OR is an exception to the normal
rule.

/(abc|def)(123|456)/

You can read that as "(abc OR def) AND (123 OR 456)". The string "abc789"
wouldn't match, for instance.

~~~
chc
If I understand what OP means by "and", it doesn't mean "(abc OR def) AND (123
OR 456)" — it means "(abc OR def) FOLLOWED BY (123 OR 456)". Let's look at
another example, with a hypothetical & operator:

    
    
       /(\D\S)+/
       /(\D|\S)+/
       /(\D&\S)+/
    

If you look at the string "b5 ", the first regex matches "b5", the second
regex matches the whole string because all of the characters are either not a
number or not whitespace, and the third regex only matches "b", because that's
the only character that is both not a number and not whitespace.

~~~
rnovak
\D is a subset of \S, so the \S accomplishes nothing (said another way, there
is no character that matches \D that doesn't also match \S).

Secondly, there are very few intersecting character classes (sets) that I'm
aware of, and in all cases, you could achieve the desired result more clearly
in other ways.

Said another way: "AND" would just make regexes even harder to
understand/approach, and that is almost always undesirable.

~~~
chc
I was just trying to illustrate how the logic of AND differs from what was
shown above, not give a useful example.

A more practical example might be something like

    
    
        /(10|22)(.*crab.*&.*apple.*)90/ 
    

in order to only match strings where the content between the numeric codes
matches both "crab" and "apple" in any order.

To be clear, I don't know that it's useful enough to warrant inclusion in a
regex engine. I'm just trying to provide a useful illustration.

------
tantalor
Use this syntax for logical AND,

    
    
      (?=...)(?=...)
    

[http://stackoverflow.com/a/24102539](http://stackoverflow.com/a/24102539)

~~~
_acme
This syntax is not supported by the tools I use. If we're going to add to the
language, why not just add a true intersection operator? My question was meant
to be more theoretical - what was the rationale behind incorporating union
into the basic syntax, but not intersection, when regular expressions were
first adapted to text processing?

~~~
_acme
I do have awk available, which allows one to combine regular expressions using
logical operators, but I'm interested in the historical and theoretical
aspects of my inquiry, if any.

------
_RPM
AND is explicit in the groupings.

~~~
_acme
Could you please clarify what you mean by this? Are you referencing
concatenation as being a form of logical AND?

