
Perl Incompatible Regular Expressions - eatitraw
https://github.com/yandex/pire
======
brudgers
If you're interested in regular expressions and their place in automata, Jeff
Ullman's _Automata_ course starts today on Coursera:
[https://www.coursera.org/course/automata](https://www.coursera.org/course/automata)

The recent HN discussion of its announcement is here:
[https://news.ycombinator.com/item?id=10089092](https://news.ycombinator.com/item?id=10089092)

Ullman is also coauthor of "The Dragon Book".

~~~
rdc12
Thanks for the reminder, thought it was still a few weeks away.

------
nine_k
Google has a similar library with similar goals. See
[https://github.com/google/re2/wiki/CplusplusAPI](https://github.com/google/re2/wiki/CplusplusAPI)
It also removes backtracking.

The idea is that backtracking may kill performance, so a specially crafted
text that causes a lot of backtracking can be used as a DoS attack.

------
bane
Wow, really impressive. Sometimes specializing by cutting out functionality is
the right approach. In this case eliminating greedy/non-greedy matching (and
others) means this can work as a high-level triage and something with more
specificity can do the precision work once you have a candidate match.

It looks like this could have a good place in a real-time streaming
architecture somewhere.

~~~
noobermin
The readme does say it grew out of the Yandex's webcrawler

------
jhallenworld
README.ru has the real documentation- google translate does a pretty good job
with it. It mentions that the algorithms are from the Dragon book.

I didn't try the code, but I think it's missing full Unicode character class
support (for example when you use \w). But I see it handles Russian :-)

[https://github.com/yandex/pire/blob/master/pire/classes.cpp#...](https://github.com/yandex/pire/blob/master/pire/classes.cpp#L82)

------
js2
See also [https://swtch.com/%7Ersc/regexp/](https://swtch.com/%7Ersc/regexp/)

------
rnovak
What I don't get is that the example given:

    
    
         hello\\s+w.+d$
    

Is 100% perl compatible, seems more like "subset" than "incompatible". I've
seen comments that say it's a "joke". Can any confirm that the title was
indeed a joke?

Edit: I know both what a DFA/NFA are, and how they relate to formal language
theory and regular languages, the question still stands how a subset can be
called "incompatible"

~~~
brudgers
Perl's regex engine is uses a non-deterministic finite automata [NFA]. Because
the Readme indicates each character is only examined once and that PIRE lacks
backtracking, look ahead and capture groups, it's engine is almost certainly
DFA [deterministic finite automata] based. Thus among the syntax it is likely
to choke on or ignore are anything involving curly braces _{}_.

Keep in mind that "regular expressions" can denote a notation for describing
finite automata and that this is subtly different from the programming
language implementation of regex engines in languages like Perl.

~~~
more_original
Perl's regex engine must use something stronger than plain NFAs. The
expressive power of NFAs and DFAs is exactly the same. They both recognize the
regular languages, which is less than what can be expressed with Perl "regular
expressions".

~~~
brudgers
I'm using Friedl's classification scheme for regex engines from _Mastering
Regular Expressions_. I don't know of a more standard survey regarding regex's
as implemented in various programming languages. Anyway, from a practical
standpoint an NFA has to be implemented as a push down automata in the Von
Neumann machines we currently have to allow backtracking to simulate
simultaneous exploration of the arbitrary number of DFA states that a single
NFA state may represent. That doesn't make Friedl's classification useless.

------
a8da6b0c91d
What was wrong with the GNU basic regex?

If you're going to write a stripped down string matching syntax more strictly
for "regular" text then why bother mentioning perl?

~~~
lylepstein
Seems like a riff on the famous Perl Compatible Regular Expressions library
([http://www.pcre.org](http://www.pcre.org)), which is used in a bunch of
high-profile things (PHP, and Apache Server, for starters). Kind of like
"less" vs "more".

So a bit of an inside joke I guess, but most people familiar with regexes will
probably have heard of PCRE, so it's not a terribly obscure reference. I liked
it :)

~~~
a8da6b0c91d
PCRE is a set of extensions to RE to enable parsing of nonregular grammars.
This is just RE. It's like calling some hypothetical language C++--. It
doesn't make sense.

~~~
kbenson
Well, if I designed a language that was C++ but stripped of some specific
features for a purpose that made it less than C++, but still not quite just C,
I might call it C++-- as a joke. In that respect, I think it makes perfect
sense.

------
nn3
Scary to think that a major search engine really uses regular expressions
heavily. Regexprs are great for quick scripts, but one would expect that in
major production applications better and higher level parsing algorithms would
be used. It must be a nightmare to debug if you have a lot of reg-exprs
interacting in a large code base.

~~~
loup-vaillant
If the language you're trying to parse happens to be regular, then regular
expressions are the _perfect_ tool for the job. They're simple, reliable, and
impossibly fast.

When you think about it, a search engine is mainly about finding key words in
a huge pile of text, without caring much about the structure of the language
the text is written in. That use case is totally specifiable by a regular
language.

