Every Finite Automaton Has a Corresponding Regular Expression

Syzygies · on Nov 21, 2019

The first chapter of Berstel and Reutenauer's "Noncommutative Rational Series with Applications" presents Schützenberger's theorem that every noncommuting rational power series is representable, and conversely. The idea is NOT painfully abstract, but makes twenty minutes work of a semester of undergraduate automata theory (an assertion I've tested multiple times in my math office hours).

They work with coefficients in an arbitrary semiring, which can be like watching paint dry. However, boolean true/false, and probabilities, make two great examples. The idea that hidden Markov chains is exactly the same theory as CS automata theory is mind-blowing.

Once one absorbs this, there's an interesting wrinkle worked out in the 1960's that is still poorly understood. If the outputs are probabilities, do the internal coefficients of the matrices need to be probabilities? Why yes, if we are so enamored with fulfilling our expectations that we can simply draw graphs and label them in understandable ways for people who aren't willing to work very hard or defy expectations. Why no, if one chases the actual math. There are examples that can require arbitrarily many matrix dimensions to model using probabilities, that can be modeled in few dimensions using real numbers outside the [0,1] range.

This has a simple explanation. There's a pattern one sees twice, a linear combination with coefficients in [0,1] that sum to one: A probability distribution, and barycentric coordinates. One can have a point cloud of small dimension, whose convex hull requires arbitrarily many vertices. Restricting the coefficients to the [0,1] range means working with these vertices as separate matrix dimensions. Leaving the coefficients unrestricted means working with the raw dimension.

Donald · on Nov 21, 2019

This perspective is excellent. Besides the book reference, any papers in this area that are relevant?

minipci1321 · on Nov 21, 2019

Not direct answer to your question, but Berstel's publications:

http://www-igm.univ-mlv.fr/~berstel/PubsJeanBerstel.html

You can try to contact him, he might be able to help you.

ridiculous_fish · on Nov 21, 2019

I worked on a tool that infers a regex from sample strings, by building and then reducing an NDFA.

The NDFA->regex construction worked by "collapsing" a node in the NDFA. Choose a victim node. Construct a Kleene star for all "self loops." Lastly, form all triples of (incoming, self-loop, outgoing) edges; these become the labels of new edges. Now you can discard the victim node. Repeat until you're down to a single start -> goal node, and you're done (the edges just become alternations).

Surprisingly the length of a regex would vary dramatically by the choice of nodes to collapse. Finding the minimum regex was an unexpected challenge. Has anyone explored regexp minimization?

Incidentally it's not the case that Thompson and subset construction "are the basis of every lexer and regexp engine there is." It's not even true of the regexp engine inside the browser you're using now!

captn3m0 · on Nov 21, 2019

If someone is looking, https://github.com/noprompt/frak does the same..

>frak transforms collections of strings into regular expressions for matching those strings. The primary goal of this library is to generate regular expressions from a known set of inputs which avoid backtracking as much as possible. It is available as a command line utility and for the browser as a JavaScript library.

_8huj · on Nov 21, 2019

This sounds exactly like what I was hoping to learn more about.

I wanted to create an IRC client where I would add/remove modules at runtime. These modules might respond to things said in-chat. As an example, you might have a greeter module that responds to common things like "hi" and "hello". The module would register expressions for things it triggers on. In parallel, all these modules would have registered expressions. Instead of testing all these expressions in series, I wanted to feed these to some code that would decompose and reduce them into a simplified expression. It'd be like if you arranged the regexps in a nested series of if-conditions.

Finding an appropriate middle ground for testing expensive regular expressions was the objective of all that, but I didn't get very far as I lost myself in Lua's LPeg project.

What a beautiful goddamn project. So unique and amazing.

jbapple · on Nov 21, 2019

> Has anyone explored regexp minimization?

There has been significant work on https://en.wikipedia.org/wiki/DFA_minimization

bane · on Nov 21, 2019

Very cool, do you know of any tool that does something similar (provide examples, get regex)?

jraph · on Nov 21, 2019

I am not sure this is what you are looking for, but I have been working on Aude (AUtomata DEmystifier), an open source pedagogical application targeted at CS teachers and students that works in browsers without installation (but one can download and run it offline too).

The aim is to visualise and manipulate automata, including conversions between them and regular expressions. Happy to take feedback!

https://aude.imag.fr

AlchemistCamp · on Nov 21, 2019

That's a neat project! What inspired you to start it?

jraph · on Nov 21, 2019

Thanks for the kind words.

Well, I was a student and needed to practice the related lesson, "Languages and Automata", before the final exam. So I implemented the algorithms of the lesson and used Graphviz to render the result. The thing worked in a browser but ran on the server (using D!). I figured my fellow students or the teacher may find it useful, so I sent the link to the teacher (in a post-scriptum of a very long mail in which I was asking for help, ah ah).

"Your tool seems really interesting, what about you continue developing this during an internship this summer?"

Hell, yes. And Aude started.

Since then, I was lucky to mentor several interns to work on Aude with this teacher.

AlchemistCamp · on Nov 21, 2019

That is very cool and I'll bet having a steady flow of later cohorts of students helped a ton!

jraph · on Nov 21, 2019

I would not say a steady flow, there were two to three students two months per summer these last three years, but they helped a lot anyway indeed :-)

And it was fun to work with them. The first "cohort" of three students called themselves the Aude Team and they put that in their internship report.

Gys · on Nov 21, 2019

Maybe this: http://regex.inginf.units.it/

pottertheotter · on Nov 21, 2019

Microsoft PROSE kind of does. I haven't looked at it for a while so can't quite remember, but might be of interest to you.

2sk21 · on Nov 21, 2019

This is an example of an important piece of knowledge that typically only programmers with a formal education in computer science are taught. This theorem is taught in all formal language courses.

AlchemistCamp · on Nov 21, 2019

Pretty much anybody who has read a book on automata (even if long ago) will remember that each item in this list can express a superset of what the previous can:

- finite state machines

- regular expressions[1]

- context free grammars

- recursive languages

- turing complete languages

- an even larger set of algorithms which cannot be encoded in any language (google Cantor Diagonalization argument, Turing Machines)

Many people, myself included, have read such books outside of the context of formal schooling. It's likely an order of magnitude more have gone through an automata MOOC. Other than Data Structures & Algorithms, it's probably the formal CS schooling material the comes up most broadly in professional programming careers.

[1]Caveat: Regular languages are equal in power to FSM but what we call "regex" have actually been extended past the point of technically being regex with PCRE and lookaround expressions. Thank you Perl and friends!

klodolph · on Nov 21, 2019

This is misleading. Regular expressions are equally as powerful as finite state machines. So it would go:

- finite state machines / regular expressions / regular languages

- context-free grammars / pushdown automata / context-free languages

- recursive languages (in general, languages programmers use fall into this category)

- turing machines / recursively enumerable languages (some programming languages fall in this category, but it’s usually by accident)

- non-computable, e.g. busy beaver

Playing fast and loose with language / program because of the equivalences. Also note “regular expressions” above doesn’t include PCRE or things like backreferences.

albertzeyer · on Nov 21, 2019

Btw, to add on that: A pushdown automata with two separate stacks is as powerful as a turing complete language. And you could interpret a standard finite state machine as a pushdown automata without any stack. So the number of stacks is what makes it more powerful (although more than 2 stacks does not make it more powerful).

AlchemistCamp · on Nov 21, 2019

Thanks, updated and added a note for regexp so as not to mislead!

klodolph · on Nov 21, 2019

In that case it’s absolutely incorrect, not merely misleading. It definitely should not be between FSM and CFG. It should either be equal to FSM (regular expressions) or it should be much more powerful (if you have backreferences, like PCRE does, you can encode SAT problems in your regexes, and this cannot be done with CFG, but recursive languages are strictly more powerful).

AlchemistCamp · on Nov 22, 2019

> It should either be equal to FSM...

superset != strict superset

klodolph · on Nov 22, 2019

That's why I originally said “misleading”, if we’re going to be pedantic.

stcredzero · on Nov 21, 2019

This is an example of an important piece of knowledge that typically only programmers with a formal education in computer science are taught.

This is the 21st Century. Why aren't we teaching stuff like this in Middle School? It wouldn't require an entire course. Just cover FSM and stack machines, then letting people know that there's stuff beyond that. This would get people a basic level of the "mental furnishings" to deal with notions of computational complexity, beyond the level of total fuzzy "woo." In 2019, ordinary people need to know something about this stuff to make decisions at work and to vote!

Yes, just the most basic, minimally concrete idea of how this stuff works would be valuable for society. When Ford automobiles were new, some new car owners would try to "fix" their cars by hanging bulbs of garlic under the hood. Today, we find this quaint, magical thinking, because everyone has the barest, minimally concrete, idea of how a car engine works. Are you annoyed at the general ignorance of politicians around technical issues? Well, it's just analogous to politicians with the above level of automotive ignorance passing motor vehicle laws.

gumby · on Nov 21, 2019

A couple of reasons spring to mind. One is that teachers don’t know the stuff. I lived through the “new math” revolution of the 60s and 70s. The folks who came up with it really had a good idea: teach kids maths rather than memorizing times tables and blind syntax manipulation. They had training for the teachers...who were taught new things to do in the classrooms...which ended up simply as a different vocabulary for syntax manipulation. Same (and not just in schools) where “using a computer” meant memorizing mouse gestures and movin the mouse to menu items in known locations on the screen.

The other problem is the old “why teach algebra? Nobody will ever see it after they leave school”

dgb23 · on Nov 21, 2019

I agree in general but I find this piece of knowledge to be too theoretical and specific (the equivalence of regex & finite automata).

I find it much more important to know how networking and the Internet works and basic client-server architecture. And maybe some things about authentication. These topics seem much more tangible and important to non-technical people than CS theory.

pingyong · on Nov 21, 2019

Pretty sure this equivalence was more or less the first sentence in the regex tutorial I read during high school...

hyperpape · on Nov 21, 2019

General technical literacy is important, but I have a hard time connecting the equivalence of finite automata and regular expressions to anything one would vote on (and I say that as a fan of automata!)

stcredzero · on Nov 22, 2019

That specific tidbit isn't the point. It's people trained to think about algorithmic complexity, having had some minimal experience with factors which can increase it.

saagarjha · on Nov 21, 2019

I had friends who came into college already knowing this, so this isn't a given.

coreyp_1 · on Nov 21, 2019

You may know of a few individuals who already knew this, but that doesn't discount the gp's word "typically". Most people don't know about it, period.

For those who do know about it, unless they have taken an Automata or Theory of Computing course, then they have probably still missed the point of this factoid, which is actually the main reason that this theorem is important.

A Finite State Machine is capable of performing many computations, but is limited in processing capability in that it has no memory. A Regular Expression is capable of expressing the exact same computations as the FSM, and is subject to the same constraint as the FSM and therefore the same limitations. (Of course, it is also important to know that the RegEx that is talked about in an Automata class is different from the RegEx that most programmers and SysOps deal with in practice).

The underlying (important) concept is that the process of validating the pattern of characters in any string is simply a proxy for understanding a computational problem. In fact, all computational problems can be represented in this paradigm (the validation of a string of input). Whether or not the machine in question can perform that validation is the purpose of the Automata or Theory of Computing class.

It's not sufficient to say "I know a fact." And simply knowing the fact isn't the point of the class. The reason for the class is to understand why the fact is important to know.

saagarjha · on Nov 21, 2019

I am specifically asserting that there were people in my college freshman class who had a handle on what the statement meant, its implications, and a whole lot more on formal automata than you've talked about here. Mostly because they had an interest in it before arriving and were exposed to it in some way beforehand, be it through an actual textbook or practical experience. What I'm trying to say is that there is nothing that can only be learned in a class; it just often provides a decent framework for getting a well-rounded exposition to a topic.

anentropic · on Nov 21, 2019

I would love to read more about why this is important/interesting/useful

My eyes very quickly glazed over while reading the blog post, it seemed like an obscure theoretical curiosity

umanwizard · on Nov 21, 2019

I know this and I don’t have a formal education in computer science.

In general you don’t need to pay someone to tell you to read books. You can just read them.

dgb23 · on Nov 21, 2019

That is certainly true but these theoretical concepts don't seem to have obvious/tangible benefits at first, so it is very understandable that many who are self-taught wouldn't know much about them.

tomxor · on Nov 21, 2019

This is true for me, i'm unlikely to have learnt about this unless it particularly interested me.

tomxor · on Nov 21, 2019

Out of interest, what book did you read that explained this?

I know a bit about FSMs, but for regex I only know they are based on them - However my self learning tends to be more impulsive and reactive.

edflsafoiewq · on Nov 21, 2019

I think I learned this because of the pumping lemma for regular languages. In the FSM picture, it becomes totally obvious: every sufficiently long path in a finite graph contains a loop.

umanwizard · on Nov 21, 2019

I don’t remember, since I’ve known about this construction for years. Could just be some lecture notes I downloaded from somewhere.

pubby · on Nov 21, 2019

The "Dragon" compiler book is pretty comprehensive on regex.

joppy · on Nov 21, 2019

That’s a very cool construction of regular expressions, just using matrices and semirings. I loved the explanation of why F and G were as they were as well. I had two comments:

> Assume we have a regular language (R, 1, ., 0, +, * )

Shouldn’t this data (a semiring with * -operation or “closure”) not be one regular language, but _all_ regular languages? A single regular language should be an element of R.

I also think that the power series A* = 1 + A + A^2 + ... looks more like 1/(1-A) rather than e^A, which makes sense given the defining equation A* = 1 + A(A*), but that’s not much of an issue since neither of these properly make sense in this semiring.

NHQ · on Nov 21, 2019

What is the regex for a given state of Game of Life?

pubby · on Nov 21, 2019

Game of Life isn't a finite automata. It has an infinite number of states (and is turing complete, too).

NHQ · on Nov 22, 2019

Is that why my neural network can't make any sense of GoL progression? 2^512 possible rules...