
Generate Regex from Examples - atroche
http://regex.inginf.units.it/
======
riffraff
The scholarly articles behind this:

Automatic Generation of Regular Expressions from Examples with Genetic
Programming[0]

Automatic String Replace by Examples [1]

[0]
[http://machinelearning.inginf.units.it/publications/internat...](http://machinelearning.inginf.units.it/publications/international-
conference-
publications/automaticgenerationofregularexpressionsfromexampleswithgeneticprogramming)

[1]
[http://machinelearning.inginf.units.it/publications/internat...](http://machinelearning.inginf.units.it/publications/international-
conference-publications/automaticstringreplacebyexamples)

~~~
thomasahle
A nice discussion on the hardness of finding regular expressions from
examples: [http://cstheory.blogoverflow.com/2011/08/on-learning-
regular...](http://cstheory.blogoverflow.com/2011/08/on-learning-regular-
languages/) . One interesting point: A good such system could easily break
RSA.

------
mimmuz
hi guys, thank you for your interest in our system of automatic generation of
regex.

we are experiencing some trouble with connectivity due to the... unexpected
popularity :)

we are working to solve this, but we are a very small research group with
limited resources.

meanwhile, if you are interested in the details of our algorithm take a look
at the source code hosted here:
[https://github.com/MaLeLabTs/RegexGenerator](https://github.com/MaLeLabTs/RegexGenerator)

~~~
SnaKeZ
Is it hosted by Università di Trieste?

~~~
mimmuz
It is hosted on our machine in our lab. Maybe it's time to move elsewhere...
;)

~~~
SnaKeZ
Mi sa di si ;)

~~~
mimmuz
LOL :P

------
brudgers
Given example strings:

    
    
      {a, b, c, d}
    

The most restrictive regular expression is:

    
    
      a | b | c | d
    

The most permissive regular expression is:

    
    
      *
    

If the system produces anything in between I cannot predict the output without
knowing a set of rules that are more complicated than the regular expression
that the system of rules produces. TANSTAAFL.

~~~
mimmuz
Moreover an equally restrictive regex could be:

[a-d]

I start to think that we are talking about different topics ;-)

~~~
brudgers
That assumes a particular encoding, e.g. that a,b,c,d are not names of
expressions.

------
kordless
Sounds like the Universal Field Extractor:
[https://splunkbase.splunk.com/app/494/](https://splunkbase.splunk.com/app/494/)

------
dewey
Archived site, seems to be down right now:
[https://archive.is/J8MIR](https://archive.is/J8MIR)

~~~
Retr0spectrum
What regex does it generate for the inputs shown on that page?

------
amelius
I can't test it right now (site down), but I assume that if I give it the
examples 1, 2, 23, 512, 461, 781, it will come up with the regular expression
([0-9]+).

However, it could also have come up with the rule ([0-8]*).

So how does it know which one to choose? Can one also submit counter-examples?

~~~
mimmuz
the system tries to generalize, so probably it choose something like \d+ (it
is shorter)

~~~
nothrabannosir
Depending on unicode settings, \d != [0-9]. See:

[http://stackoverflow.com/a/6479605/4359699](http://stackoverflow.com/a/6479605/4359699)

------
malditojavi
Looks promising if the title matches what it does - but page does not load.

------
captn3m0
I've been looking for tools that would help me do this. Are there any
interesting datasets that you have converted to regexes with this?

Zipcodes would be something I'd like to see available as regexes.

~~~
aaronem
I can't load the site, so I'm not sure exactly what it's offering, but I'm
also not sure what you mean by a regex for zipcodes. Surely it'd be simpler,
quicker, and more maintainable just to match /(\d{5})/ and filter out invalid
codes after the fact?

(Edit: If you want the full ZIP+4 where it's given, you can add (\\-\d{4})? as
a term in the regex, and you can validate them the same way. USPS also offers
an API for validating address information; it's been some years since I used
it, and I recall it being somewhat recondite in the fashion of most .gov APIs,
but it's there and it works.)

Edit again: OK, the site loaded. It looks like a tool for generating regexes
via machine learning techniques to match highly complex targets. From a purely
technical perspective, that sounds really neat -- but from a maintainability
perspective, it sounds like even more of a nightmare than complex regexes
usually are.

~~~
ftarlao
The site has gone down, once for a power outrage and another time cause the
traffic peak... I think the problem are solved now. :-)

~~~
aaronem
> power outrage

I like that!

~~~
ftarlao
Auch! my fault, it sounds like but it is not :-)

------
chavo-b
why the resulting regex contains ++ instead of only one + ? such regexs are
wrong!

~~~
logn
"Possessive quantifiers are a way to prevent the regex engine from trying all
permutations. This is primarily useful for performance reasons. You can also
use possessive quantifiers to eliminate certain matches."

[http://www.regular-expressions.info/possessive.html](http://www.regular-
expressions.info/possessive.html)

~~~
chavo-b
ok, tnx, now it makes sense

------
quinta
crossposted here:
[https://news.ycombinator.com/item?id=4682545](https://news.ycombinator.com/item?id=4682545)

~~~
ExpiredLink
... 1006 days ago.

~~~
ftarlao
... in order to be fair we have recently updated the engine and webinterface;
the new application is a big step forward, the same URL but a totally
different thing. The older tool is here,
[http://regex.inginf.units.it/extract/](http://regex.inginf.units.it/extract/)

