
Exploring the Linguistics Behind Regular Expressions - rbanffy
https://dev.to/alainakafkes/exploring-the-linguistics-behind-regular-expressions-bb4
======
zokier
What I find fascinating in the history of regexps is that their syntax has
remained remarkably similar over the years and across very different
implementations and uses. Considering how much variance we have in programming
languages in general, and how many people consider regexp syntax to be
unfriendly, I'm not sure why there hasn't been more experimentation (and
serious alternatives!) in this area.

~~~
jwilk
Perl 6 breaks with the traditional regexp syntax:

[https://docs.perl6.org/language/regexes](https://docs.perl6.org/language/regexes)

~~~
zokier
Also breaks with the traditional regexp by not actually being regular
expressions :)

~~~
brianon99
I think what most people mean when they say 'regex' is actually 2 thing:

1\. The syntax of that pcre-like regex engine accept.

2\. regular language, a kind of formal language.

Many regex engine nowadays like the one in perl5 and onigmura already breaks
2, but still makes 1 compatible. I think what perl6 does is also breaks 1. (I
am not experienced in Perl6. Please correct me if I am wrong.) I don't think
it is a problem, though.

~~~
b2gills
In Perl 6 regexes are a type of method, and you can use them in grammars which
are a type of class. (You can use them on their own as well)

Which means you can subclass grammars, compose in regexes with roles, and have
parameterized regexes.

The syntax has also had an overhaul to make it more consistent with itself as
well as the rest of Perl 6. Since you can embed Perl 6 code, some features of
other regular expression engines haven't been implemented as they aren't
needed.

The result of using a regex or grammar is also now a parse tree rather than
True/False or the matched substring.

I generally recommend reading the code for JSON::Tiny::Grammar as a quick
example of what it is like.
[https://github.com/moritz/json/blob/master/lib/JSON/Tiny/Gra...](https://github.com/moritz/json/blob/master/lib/JSON/Tiny/Grammar.pm)

------
posterboy
I expected the article to dive a bit deeper into natural language
understanding, identifying regexes in natural language and how those
constructs are used to build grammars higher in the hierarchy.

I should really get around to read more on this, but it quickly explodes in
complexity. Suddenly I'm on wikipedia reading about the analytic hierarchy of
mathematics. All the while hardly anyone seems to expect that english should
adhere to formal grammars.

It would be really interesting instead to only use the concepts that have been
introduced. So, what's the bare minimum to be expected from a speaker prior to
such a text? Purely appending composition, I guess, enumerations, starting
with 'yes' and _. nothing. Typically the first word in a conversation is a
greeting. Hello world! On the other hand, there's an obvious relation between
'I' and '1'.

It's interesting that a lot of language can be built up so that a sentence can
be understood at each stage of its build (with a bit of abusing the language).
Words then are learned simply by association of being close to other known
words, so "hello" is implicitly expanded to a whole context. Which in turn is
learned from gestures. And continuous repetition is very important. Feedback.
ie. success is learned from quieting a crying baby down. Words are learned by
echoing back wkrds that are heard repeatedly (rather phonemes, so I would
start with "hi", not "hello"). And a lot of repetition is own sounds, to learn
sounding at all and to not forget them again.

And later, hole ideas have to be repeated again and again and refined ...
which I guess is why I am writing all this.

Although, "simulation" allows us to do all this quietly and heuristics and
proofs can significantly simplify the process. I guess that can be linked to
context free and higher grammars.

And because of the repition I appreciate this post.

------
steffann
I get the suspicion that the writer doesn't understand the ?, * and +
operators...

~~~
yorwba
There is a problem here in that different regex libraries have different
semantics for these.

I checked the manual for PCRE ( _man pcrepattern_ ), and it says that ? has
both the meaning of {0,1} (zero or one repetition), as well as turning * and +
into non-greedy variants if directly following them.

Similarly, + usually has the meaning of {1,} (at least once) but can also
quantify * and + to prevent backtracking.

For an engine whose semantics differ from PCRE, non-greedy matching or
backtracking might not even make sense, if the matching is implemented
differently (e.g. using finite automata that don't backtrack).

------
rocqua
I expected more focus on the creeping capabilities of regexes. Especially how
this relates to the tendency towards Turing completeness.

I've heard it said that every language is 'doomed' to creep towards being
turing complete. This is 'doom' because turing completeness entails suffering
from the halting problem.

~~~
crdoconnor
>I've heard it said that every language is 'doomed' to creep towards being
turing complete.

Usually that happens for DSLs for tools which probably should have just been
ordinary libraries in existing languages (e.g. ant).

It's often tricky knowing where architecturally to draw the line between
turing completeness and non-turing completeness and the technology landscape
is littered with examples of tools which put it in the wrong place and later
tried to hack around it.

Turing completeness where it isn't necessary IMHO isn't really a problem
because of the halting problem per se - it's a problem because turing complete
code has a higher maintenance cost at the best of times and attracts a ton of
technical debt at the worst.

Old school frameworkless PHP was the clearest example of this IMHO - the lack
of a clear separation between business (should be TC) logic and presentation
logic (should be low powered templating language) caused messes all over the
place.

~~~
eesmith
It's also tricky because it's so easy to make a Turing complete system by
accident. You may not even realize that you've crossed that line.

~~~
crdoconnor
It really _shouldn 't_ be hard to tell for the designer. If you're considering
implementing loops, conditionals or variables in your DSL then you should kind
of realize what direction you're headed in.

The hard part is realizing from the get go (before backwards compatibility
concerns kick in), that your problem space is not conducive to non-turing
complete languages in the first place, and that instead of inventing an
exciting new DSL, maybe you should just write a library.

~~~
eesmith
When CSS+HTML5 became Turing complete, do you think the designers knew it?

Or the designers of page fault handling in X86?

As I understand it, C++ templates were not supposed to be Turing complete, but
they are.

These examples and more come from
[http://beza1e1.tuxen.de/articles/accidentally_turing_complet...](http://beza1e1.tuxen.de/articles/accidentally_turing_complete.html)
.

------
ggm
Regular Expressions existed before UNIX. But G/RE/P made regular expressions
both expressive, and commonplace. POSIX carried the job forward, Perl had a
role to play too.

One family of expression in grep, sed, awk, ed, ex and vi. Thats awesome.

What I find strange, is how late EMACS family editing came to a sensible
mechanism to use them. Global search and replace in emacs has always felt
significantly more 'clumsy' than in the ed/ex/vi family.

Maybe its me.

------
carapace
One of the strange and wonderful things in the history of the world is that
Chomsky's Transformational Grammar forms the basis of both computer languages
_and_ Neuro-Linguistic Programming.

Part of the origin story of NLP is that they used Transformational Grammar to
analyze therapeutic exchanges between therapists and clients. The "Meta Model"
explicitly uses grammatical structure to detect missing or elided information.

