Generate Regex from Examples

riffraff · on July 25, 2015

The scholarly articles behind this:

Automatic Generation of Regular Expressions from Examples with Genetic Programming[0]

Automatic String Replace by Examples [1]

[0] http://machinelearning.inginf.units.it/publications/internat...

[1] http://machinelearning.inginf.units.it/publications/internat...

thomasahle · on July 25, 2015

A nice discussion on the hardness of finding regular expressions from examples: http://cstheory.blogoverflow.com/2011/08/on-learning-regular... . One interesting point: A good such system could easily break RSA.

mimmuz · on July 25, 2015

hi guys, thank you for your interest in our system of automatic generation of regex.

we are experiencing some trouble with connectivity due to the... unexpected popularity :)

we are working to solve this, but we are a very small research group with limited resources.

meanwhile, if you are interested in the details of our algorithm take a look at the source code hosted here: https://github.com/MaLeLabTs/RegexGenerator

SnaKeZ · on July 25, 2015

Is it hosted by Università di Trieste?

mimmuz · on July 25, 2015

It is hosted on our machine in our lab. Maybe it's time to move elsewhere... ;)

SnaKeZ · on July 25, 2015

Mi sa di si ;)

mimmuz · on July 25, 2015

LOL :P

brudgers · on July 26, 2015

Given example strings:

  {a, b, c, d}

The most restrictive regular expression is:

  a | b | c | d

The most permissive regular expression is:

If the system produces anything in between I cannot predict the output without knowing a set of rules that are more complicated than the regular expression that the system of rules produces. TANSTAAFL.

mimmuz · on July 26, 2015

Moreover an equally restrictive regex could be:

[a-d]

I start to think that we are talking about different topics ;-)

brudgers · on July 26, 2015

That assumes a particular encoding, e.g. that a,b,c,d are not names of expressions.

mimmuz · on July 26, 2015

the system requires also negative examples

brudgers · on July 26, 2015

Negative examples do not change the most restrictive regular language. It remains the union of the accepted strings.

Negative examples constrain the most permissive interpretation to the regular language defined:

   1a. Let NegativeExamples be the most restrictive regular
   expression [as above] using the negative examples as 
   input.

      or

   1b. Let NegativeExamples be the most permissive regular 
   expression [as above] using the negative exmaples as
   input.

   2. Let U be the regular language defined by the kleene star.

The negative examples define the regular language:

   U - NegativeExamples

And depending on the choice of 1a or 1b, you get a regular language slightly more restricted than the kleene star or the empty language.

Regular expressions are a compact way of precisely describing an enumeration of permutations. The only way to get equal precision via enumeration is to fully enumerate all the permutations. Once a parser gets into the picture, then the input strings have to have semantics that the parser can use and a hidden set of parser rules for interpreting those semantics.

mimmuz · on July 26, 2015

I do not understand the point. Really. My limit... Or you misunderstood the purpose of the automatic regex generator.

brudgers · on July 26, 2015

[skipping wikipedia links]

Formally, regular expressions describe finite automata. A finite automata is a state machine that accepts a language. Hence, a regular expression defines a language [and in particular a regular language]. Regular languages have closure properties, which is to say that under certain operations two regular languages produce another regular language, and therefore certain operations on regular expressions can produce another regular expression.

The regular expression generator takes some input and produce a regular expression. There are two possible scenarios:

  1. Treat the input as one or more regular languages
  and perform only closed operations.

  2. Treat the input as some other type of language,
  e.g. a context free language and parse the input
  according to the rules of that language to produce
  a regular expression.

The first option means that we can determine the output of the regular expression generator by knowing only one rule: is does it generate regular expressions starting by adding to the empty language or by difference from the kleene star. But there is still ambiguity that requires we have knowledge about the internal workings of the expression generator.

The second option means that in order to determine what output a given input will produce we have to know about a potentially unlimited number of rules (and these rules may change dynamically based on the input)[1].

What this means is that no regular expression generator is generally useful as a black box. Ambiguity assures it, and ambiguity is intrinsic because the regular expression generator must treat the input as within a more complex language (e.g. context free) and force it into the simpler class of regular languages. As my analysis shows, the way in which the regular expression generator interprets the input is arbitrary.

If you're really interested in the topic, Jeff Ulman (who with Alfred Aho wrote the book on automata) teaches a course on automata via Coursera. Rumor (well actually an email) has it that he is teaching it again this fall. A person could do worse than learning about automata from a person down a Stanford hallway from Donald Knuth's office.

Even cooler is that Jeff Ullman shows up in class to discuss questions (which is probably why the course only comes around once a year).

https://www.coursera.org/course/automata

Just because lunches cannot be free doesn't mean they can't be cheap.

[1] Moreover, via the Halting Problem we cannot be certain that the regular expression generator will produce any output at all.

mimmuz · on July 26, 2015

I think I understood your point but I suppose you misunderstood our intent. Daily extraction tasks are commonly solvable with a regular expression, this tool is intended for people handling a task where a regular expression suffices in order to solve it. An extraction task may be solved with a regular expression or may not, in the second case the regex generator is not the right tool for the job.

The provided examples often cover a subset of the target language; there are infinite regular languages that fit the task---described by provided examples---and infinite number of regular expressions generating it. Regex generator searches for a regular expression taking into account other constraints---i.e: it prefers small regular expressions. In this way we use the regular expression length as heuristic that pushes towards more generic regular expressions and more human-readable-understandable solutions.

We know that we cannot infer a regular language from an incomplete subset of it; regex generator is intended as a practical tool that solves a real-world problem.

Anyhow, the literature is full of papers about inferring an automata from examples: http://link.springer.com/chapter/10.1007/BFb0054059 http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1299597 http://www.sciencedirect.com/science/article/pii/S0031320305... http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1432740

brudgers · on July 26, 2015

1. The smallest regular expression that contains the input is either:

  The empty language + the positive examples

or

  Kleene star - the negative examples

arbitrarily based on the number of examples in one set versus the other.

2. The only way for a user to know if the regular expression provided by the output is meaningful is to already know the regular expression that describes the desired state machine...or to knowingly accept that the output regular expression is based on rules that are not derived directly from the input and ideally to understand what those rules are. That's fine for sophisticated users who understand the context, but not for people who don't understand automata.

3. The end user gains nothing by withholding training data because the regular expression is deterministic. Withholding training data is only useful for understanding the heuristics of the generator and tuning it. The regular expression itself is simply right or wrong.

4. While automatic generation of automata sounds like the sort of thing that solves real world problems, the automatic generation of regular expressions of the sort programmers rely on seems more likely to produce bugs of the sort that arise when a programmer writes code that they don't understand.

mimmuz · on July 27, 2015

1. what do you mean with "smallest" (number of state, number of characters, ...)?

2. to evaluate the solution quality on the training data is wrong. In order to mitigate the overfitting risk, the Regex Generator learns the regular expressions from half of the training examples and validates them on the other half of the examples. We also assessed our algorithm on 20 extraction tasks and evaluated the final solutions on unknown corpora (testing): the quality of final solutions is comparable to expert human solutions.

"Sophisticated" regex users don't need our tool: regex generator is intended for novice users or to demonstrate that we can automatically find solutions which are comparable with human ones.

Please note that defining an extraction task is always error prone, there may be errors in the task definition (understanding) or during the regex coding; smart and expert programmers make errors too, there may be corner cases they have not thought about. Sometimes, you need to get the job done--with a fair confidence--and improve it later. This is our view of the real world.

Most important thing: your criticisms are valid for all the problems of supervised machine learning. Do you really think taht driverless car, antispam filters, automatic transaltors and so on are useless only because they are trying to infer a model from partial data? I do not think so.

qu4z-2 · on July 27, 2015

If the inputs are

    +ve: a | aaa | aaaaaaa | aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
    -ve: b | 1000 | cd

a+ would by most measures be a "smaller" regular expression than a|aaa|aaaaaaa|aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

No?

EDIT: Formatting.

brudgers · on July 27, 2015

If we imagine aa is an accepting state, then we can imagine that a+ is a good return value. And it's easy. That doesn't mean it's a good idea since accepting aa may be a bug in our larger code.

qu4z-2 · on Aug 4, 2015

Well, any heuristic like that is possibly going to give either false positives or false negatives when applied to a set larger than the training data. A pure white-list approach is definitely an option for determining the accepted inputs, but generally some heuristic that attempts to accept inputs "like" the examples will probably be better.

kordless · on July 25, 2015

Sounds like the Universal Field Extractor: https://splunkbase.splunk.com/app/494/

dewey · on July 25, 2015

Archived site, seems to be down right now: https://archive.is/J8MIR

Retr0spectrum · on July 25, 2015

What regex does it generate for the inputs shown on that page?

amelius · on July 25, 2015

I can't test it right now (site down), but I assume that if I give it the examples 1, 2, 23, 512, 461, 781, it will come up with the regular expression ([0-9]+).

However, it could also have come up with the rule ([0-8]*).

So how does it know which one to choose? Can one also submit counter-examples?

jlarocco · on July 25, 2015

This can't really work 100% without exhaustively giving every string accepted by the regex, or giving every string not accepted by the regex. Regular expressions are essentially a shorthand for doing that.

Since it's impossible to do accurately, it has to use some "common sense" via genetic programming or whatever. In your example, it probably sees that people rarely want [1-8]{1,3}, but that it's a subset of \d+ and goes with that instead.

SnaKeZ · on July 25, 2015

I tried your dataset (exported in JSON): http://pastebin.com/7XU2UfEd

The result is: \d

mimmuz · on July 25, 2015

Some negative examples can help (it is designed to perform a text extraction)

mimmuz · on July 25, 2015

the system tries to generalize, so probably it choose something like \d+ (it is shorter)

nothrabannosir · on July 25, 2015

Depending on unicode settings, \d != [0-9]. See:

http://stackoverflow.com/a/6479605/4359699

malditojavi · on July 25, 2015

Looks promising if the title matches what it does - but page does not load.

captn3m0 · on July 25, 2015

I've been looking for tools that would help me do this. Are there any interesting datasets that you have converted to regexes with this?

Zipcodes would be something I'd like to see available as regexes.

aaronem · on July 25, 2015

I can't load the site, so I'm not sure exactly what it's offering, but I'm also not sure what you mean by a regex for zipcodes. Surely it'd be simpler, quicker, and more maintainable just to match /(\d{5})/ and filter out invalid codes after the fact?

(Edit: If you want the full ZIP+4 where it's given, you can add (\-\d{4})? as a term in the regex, and you can validate them the same way. USPS also offers an API for validating address information; it's been some years since I used it, and I recall it being somewhat recondite in the fashion of most .gov APIs, but it's there and it works.)

Edit again: OK, the site loaded. It looks like a tool for generating regexes via machine learning techniques to match highly complex targets. From a purely technical perspective, that sounds really neat -- but from a maintainability perspective, it sounds like even more of a nightmare than complex regexes usually are.

ftarlao · on July 25, 2015

The site has gone down, once for a power outrage and another time cause the traffic peak... I think the problem are solved now. :-)

aaronem · on July 25, 2015

> power outrage

I like that!

ftarlao · on July 26, 2015

Auch! my fault, it sounds like but it is not :-)

atomwaffel · on July 25, 2015

The GeoNames Country Information table has a "Postal Code Regex" column that you might find interesting: http://download.geonames.org/export/dump/countryInfo.txt

aaronem · on July 25, 2015

Some of those regexes are terrifying. Granted that I'm not sure there is any better alternative for extracting structured address information from an unstructured source, but...

chavo-b · on July 25, 2015

why the resulting regex contains ++ instead of only one + ? such regexs are wrong!

logn · on July 25, 2015

"Possessive quantifiers are a way to prevent the regex engine from trying all permutations. This is primarily useful for performance reasons. You can also use possessive quantifiers to eliminate certain matches."

http://www.regular-expressions.info/possessive.html

chavo-b · on July 25, 2015

ok, tnx, now it makes sense

quinta · on July 25, 2015

crossposted here: https://news.ycombinator.com/item?id=4682545

ExpiredLink · on July 25, 2015

... 1006 days ago.

ftarlao · on July 26, 2015

... in order to be fair we have recently updated the engine and webinterface; the new application is a big step forward, the same URL but a totally different thing. The older tool is here, http://regex.inginf.units.it/extract/