
Automatic Generation of Regular Expressions from Examples - ColinWright
http://regex.inginf.units.it/
======
bartolialberto
I am the "boss" of the research group that put the regex generator tool on
line (<http://machinelearning.inginf.units.it/> "Data and tools"): we are all
VERY SORRY for any inconvenience we may have created.

We simply did not expect so much interest in our work. Yesterday we received
more than 11000 visits. We are not prepared to handle such traffic.

We are working toward adding a few more CPUs and will notify the status on our
Twitter account: <http://twitter.com/@MaleLabTs>.

We are afraid we will not be able to fully solve the overload problem anytime
soon, though. We are a small research group, with scarce resources.

We are also preparing a small faq for addressing several of the interesting
points raised in this discussion (e.g., why not executing everything in
Javascript ?). We will notify completion via Twitter. Further comments and
criticisms are most welcome.

Thanks a lot for your interest and patience.

~~~
ColinWright
That's a useful response - thanks for taking the time to register and respond.

Apologies for the extreme load caused - when geeks find something cool they
tend to share it, and not everyone is ready for that level of interest.

I, for one, look forward to your updates. What would be really great would be
for you to prepare some kind of informal overview and put that up as a blog
post, making it clear that it's intended to be informal, and is not in any way
your formal research results or anything like that. Then post here a link to
it. If you engage with this community you'll get a lot of smart people
offering thoughts and suggestions from which you can pick and choose.

Great work - thank you.

------
a3_nm
Interestingly, there is much to say about the theoretical problem of finding
the smallest regular expression for a set of positive and negative examples. A
nice overview: [http://cstheory.blogoverflow.com/2011/08/on-learning-
regular...](http://cstheory.blogoverflow.com/2011/08/on-learning-regular-
languages/)

------
tlrobinson
This is cool, but it doesn't look like it's doing even trivial
optimizations...

    
    
        (?:\d\d[\d\d\.\d\.][\d\.](?:(?=([\d\.]+))\1))
    

"[\d\d\\.\d\\.]" is equivalent to "[\d\\.]", right?

It's also got a useless non-capturing group and lookahead.

~~~
merlincorey
You have apparently useless genes in your genetic code as well.

These regex's are generated with genetic programming techniques, so they are
not guaranteed to be an optimal solution, just a good (by the metric of
working) one.

~~~
tlrobinson
You're right, I was just pointing out that with some basic optimizations you
could get a much cleaner result.

Applying some fairly simple optimizations to my example result I get the
following equivalent regular expression:

    
    
       \d{2}[\d\.]{3,}
    

which isn't too bad.

EDIT: Here are the optimizations:

The original:

    
    
        (?:\d\d[\d\d\.\d\.][\d\.](?:(?=([\d\.]+))\1))
    

1\. Remove non-capturing groups and lookahead:

    
    
        \d\d[\d\d\.\d\.][\d\.]([\d\.]+)\1
    

2\. Simplify set:

    
    
        \d\d[\d\.][\d\.]([\d\.]+)\1
    

3\. Remove redundant backreference:

    
    
        \d\d[\d\.][\d\.][\d\.]+
    

4\. Combine duplicates:

    
    
        \d{2}[\d\.]{3,}

~~~
mdda
The optimizations are great to apply to an end result, but the Genetic
Programming process probably won't benefit from the expressions being
optimized 'along the way' because some of the 'junk' subexpressions allow for
more robustness in the face of the genetic operators.

Putting it another way, your highly optimized version (4) is likely to be much
more brittle to small 'defects' than (1).

~~~
DannyBee
The last point can't be overstated. Optimization is highly likely to overfit
the training data.

For example, given training data 000001 000002 000003 000004

The "optimized" version would be \d+[1-4]

If the goal was simply to find the smallest regex that matches all the
training data, that's trivial to code, NP-hard to execute.

But the goal isn't to find the smallest regexp, it's to find a "good" regexp.

~~~
tlrobinson
My "optimizations" were trivial, in that they don't change the behavior of the
regex at all, they just simplify it. As khafra points out, they are
equivalent.

I also wasn't trying to say those simplifications should be applied between
iterations of the algorithm, only at the end.

~~~
mdda
Understood - and I was just making the observation that the expressions
produced during the process were junky in a way that's characteristic of the
way they are evolving : Mixing together two honed expressions would produce
much more extreme deviations in 'children' than the offspring of two of these
hairy expressions.

Not only are the expressions tending to get more accurate with each
generation, but their degree of 'evolvability' is also being selected for.

------
curiousdannii
"When there are less than 200 examples, the training set and validation set
have the same size. Otherwise, the training set is composed of 100 examples
and the other examples constitute the validation set."

Why divide the data into sets like that? Assuming the sample data is
representative, wouldn't it be better to use much more for training, say 90%
with 10% for validation?

~~~
haffi112
Then you might have the problem of over-fitting the training set.

See here for example: <http://en.wikipedia.org/wiki/Overfitting>

~~~
StavrosK
You use the validation set precisely to avoid overfitting. I agree that a
90/10 split would be more reasonable.

~~~
ZoFreX
You could also use 10-fold cross validation[1] or similar where you rotate the
training/validation sets.

[1]: [http://en.wikipedia.org/wiki/Cross-
validation_%28statistics%...](http://en.wikipedia.org/wiki/Cross-
validation_%28statistics%29#K-fold_cross-validation)

~~~
icelancer
Also natively supported in WEKA. (Not that WEKA is easy to use.)

------
draegtun
For the reverse of this see HN discussion _Given a list of regexes, generate
all possible strings that matches all of them_ \-
<http://news.ycombinator.com/item?id=2579783>

------
RyanMcGreal
I can see how this supports TDD. Write a test suite and it will generate a
regex that passes it.

~~~
ajanuary
I can't find it now, but I've read about projects generating haskell code from
unit tests. Anyone got links?

~~~
egonschiele
Do you mean QuickCheck?
[http://www.haskell.org/haskellwiki/Introduction_to_QuickChec...](http://www.haskell.org/haskellwiki/Introduction_to_QuickCheck)

~~~
Evbn
QC generated random examples from a rule, not the reverse.

------
jcromartie
Slightly off-topic: I was surprised to see a ".it" domain that was actually
Italian.

~~~
ExpiredLink
More off-topic: I wonder why 'interesting' top-level domains like '.it' or
'.at' aren't used more frequently to create 'interesting' domain names.

Edit: you.get.it

~~~
dwwoelfel
They're hard to register and cost more. I registered <http://canhekick.it>
through iwantmyname.com after someone snatched up canhekickit.com. It cost me
$30 instead of $10.

------
superasn
Sounds neat, but unfortunately it isn't working with a "Server is busy!"
error.

~~~
songgao
I wonder why the GP has to be running at server side, rather than in
JavaScript in the browser.

~~~
mistercow
It certainly does not _have_ to be running on the server side. I'm guessing
that it's just a matter of what environment the creator was comfortable with.

------
jmmcd
Sumit Gulwani of Microsoft Research has a non-genetic programming approach to
this and similar problems, which is interesting to compare:
<http://dl.acm.org/citation.cfm?doid=2240236.2240260>

------
csmatt
I attempted writing something similar when I was bored at work a year ago. I
soon discovered how difficult this problem actually is. It's definitely beyond
afternoon hacking. I'd love to get a peek at the secret sauce!

~~~
a3_nm
There is a paper which explains how it works:
[http://mllab.inginf.units.it/publications/international-
conf...](http://mllab.inginf.units.it/publications/international-conference-
publications/automaticgenerationofregularexpressionsfromexampleswithgeneticprogramming)

Note that there are other (non-genetic) approaches to learn regular
expressions (some are cited in the above paper).

~~~
csmatt
Thanks for the link. I've been meaning to get into machine learning and this
sounds like a good, tractable problem to start with.

------
arekp
This makes me think. In 2012 people still think that genetic programming, a
numerical optimization technique (and not the most efficient one), magically
solves random machine learning problems.

~~~
drblast
More efficient techniques, such as?

It would be interesting for me and others here to learn from what you know
rather than be left with what you wrote.

------
pathikrit
Funnily, I had asked this question awhile back on StackOverflow:
<http://stackoverflow.com/questions/4880402/>

------
coroxout
I look forward to trying this once the HN server overload effect has worn off.

This is something I would have found useful a year or two ago, considered
writing, and decided it would be impossible to get a usefully generalised
regex. I'm glad people cleverer and less defeatist than me had the same idea,
since it seems it is possible to get useful results after all :)

------
vineet
There is a related cool system: <http://groups.csail.mit.edu/uid/lapis/>

It is actually a text editor, and doesn't come up with Regular Expressions but
similar textual constraint patterns.

------
josephagoss
Is there any chance to look at the code (application)? I'd be interested in
running this locally as I have a need for generation of regular expressions.

------
reader5000
Instead of randomized hill-climbing (genetic programming's search algo) why
not try a little A*?

~~~
sunir
How would you envision that working? What is the estimated distance function?

~~~
DasIch
The "distance" of a regular expression could be the specifity (think CSS) or
the size of the language described by the regular expression (with some
adjustments to account for repetition as a* being infinite is probably not
what is intended.)

The real problem is that the space of possible solutions is very large, so you
will have to do that lazily, which would make an implementation in a lot of
languages rather annoying.

I'd also expect this algorithm to be biased towards regular expressions that
are more specific than the user intended them to be, especially for example
sets that are small or not very diverse.

~~~
__alexs
Is a* more or less infinite than .*?

~~~
krallja
<http://en.wikipedia.org/wiki/A*_search_algorithm>

~~~
__alexs
<http://en.wikipedia.org/wiki/Aleph_number>

------
wib
Cool & handy

------
tlarkworthy
The sledge hammer approach.

------
enraged_camel
I keep getting "server busy". Why share a link if it won't be able to handle
the traffic?

~~~
ColinWright
So, every time you share a link, you always test it first to see if it can
handle the load?

Ah, no, looking at your profile you haven't yet ever shared a link. I can only
conclude that you find lots and lots of things that you are eager to share
with us, but when you rigorously test them you find that they can't handle the
load, so you sadly decline to submit them.

Thanks for the feedback. I'll be sure to keep it in mind.

~~~
enraged_camel
This is a community for geeks. I believe it is a realistic expectation that
one should be wary of sharing with tens of thousands of readers sites that run
processor-intensive server-side scripts.

(Nice ad hominem, by the way.)

~~~
ColinWright
This is a community for geeks ( although there are those who claim otherwise:
<http://news.ycombinator.com/item?id=4681146> ). I believe it is a reasonable
expectation that people will want to see things and will understand that
sometimes they will need to spread the load. Cutting edge technology sometimes
can't handle the load, and geeks will know that, understand, and make
allowances. I would have thought that people in this community would want to
see new stuff, rather than have it withheld simply because someone thought the
server might not be able to handle the load.

How can one tell if the server is going to be able to handle the load? Perhaps
the owners had anticipated that this would be of interest and were ready to
handle it?

How could I tell?

I think it's a reasonable expectation in a community of geeks that if you are
to level criticism at me, then you should offer alternatives and constructive
suggestions, and I look forward to your reply as to what one should do when
considering submitting a link like this.

How would _you_ test it?

~~~
enraged_camel
I didn't give suggestions on how to test it because I figured someone who
knows how to automatically generate regular expressions from examples would
also know how to test it under load (or that it _should_ be tested before
being made public). It should not be that difficult to simulate simple user
behavior, especially since the majority of the functionality in this case is
tied to (presumably) one method in the code. Run it 100/1,000/10,000 times
simultaneously behind the scenes with a bunch of test examples and see if your
server vomits. Pretty easy.

Honestly, I'm a bit dismayed that you are challenging me on this. "Load-test
your code before releasing it" should be a strict requirement for anything
that is submitted here. Otherwise, how are people expected to give feedback?
Why is our time being wasted on stuff that does not work?

~~~
ColinWright
Hmm.

You do realize, do you not, that it's not my code. Or my server. I'm just
sharing something I found.

