
DeepRegex: Neural Generation of Regular Expressions from Natural Language - nicklo
http://arxiv.org/abs/1608.03000
======
nicklo
Hey HN! One of the authors here. This was a fun project to work on. Happy to
answer any questions :)

Code + data here: [https://github.com/nicholaslocascio/deep-
regex](https://github.com/nicholaslocascio/deep-regex)

~~~
jgalt212
Are you familiar with the work of the Machine Learning Lab at the University
of Trieste? And, if so, could you quickly comment on how the two approaches
differ?

[http://machinelearning.inginf.units.it/](http://machinelearning.inginf.units.it/)

[http://regex.inginf.units.it/](http://regex.inginf.units.it/)

~~~
nicklo
Yeah their work is very cool! The two tasks are a bit different however. Their
system is for the task of generating regular expressions given a set of
positive and negative examples of what the regex should match. It uses genetic
algorithms and other techniques to optimize and search for a regex that fits
all the given examples.

In our case, we have no examples to test against, only a natural language
(English) description of what the user wants the regex to do. This is an
inference problem more than a search problem as we've got one shot to give our
best guess without any tests to check against and modify our answer.

~~~
jgalt212
cool, thanks for the explanation.

------
RankingMember
As soon as I saw the title I started thinking of how great it would be to have
a solid NaturalLanguage -> Regex translator, something I've always wanted.
Regex is so powerful, but sometimes it takes so long to get it to do what you
want it to.

~~~
flanbiscuit
> how great it would be to have a solid NaturalLanguage -> Regex translator

This idea sounds good, but as soon as you start getting slightly more
complicated you'll be writing paragraphs:

Try writing this in a natural language format:

    
    
        <a\s+(?:[^>]*?\s+)?href="([^"]*)"
    

That's a regex to get the value of an href from anchor links.

"match "<a " then do not match a ">" if it exists, followed by a space, if it
exists, then match a "href=", then begin a capturing group, then match
anything but a '"' 0 or more times, then close a capturing group, then match a
'"'

I'm sure that's not even correct but you can see what I mean. I can see this
idea being a good tool for learning though, especially for smaller regex

~~~
modeless
> Try writing this in a natural language format:

> [...] That's a regex to get the value of an href from anchor links.

A true natural language to regex system would take your short natural language
description as input, not paragraphs describing the task in more detail. Of
course this would require a lot of domain knowledge about HTML, but that
knowledge is readily available out there on the internet. I think it's no
longer crazy to imagine a system which could read the internet, learn about
HTML, and apply that knowledge to answer your natural language regex query.

This is clearly far beyond where we are today, but I think a few orders of
magnitude larger neural nets would be able to handle this task, and the
hardware guys are hard at work getting us there. The pace of improvement will
be much faster than Moore's law over the next couple of years as the first
optimized neural net hardware becomes available.

------
duaneb
The applications of this for scraping alone are amazing

~~~
staticautomatic
Imagine translating natural language into XPath queries. Now that would be
something.

