

Show HN: my regular expression example match generator - rndmcnlly0
http://regexio.com/

======
rndmcnlly0
I made this tool over a year ago as a proof of concept. It doesn't support all
features (like negations inside of character classes), but it should
demonstrate the benefit that a tool like this would give you. I'd be
interested in helping someone else understand the guts and make it into a
public service to help out newbie developers coming to understand these
notoriously-difficult-but-highly-useful expression.

------
elblanco
Oh god, I've been wanting something like this for ages. I've always dreamed of
a regex driven list generator, but where I can supply constraints on
repetition operators (*->{0,2}, +->{1,3}) to limit the scope, but then get a
comprehensive list of all the matching strings.

Imagine using some rules to take something like a user driven entry, where
they'll enter something like say...a phone number freetext, but then use some
rules that turn that phone number into a regex that'll match many common ways
to write that phone number, then generate all the possible matching versions
of that phone number for submission against an indexed database of text
documents to see if any match. It's a trivial example, but a similar concept
could work for things like names that have been Romanized from a non-Latin
alphabet....generate a regex to match some variants, then generate the
variants and search an indexed database for those variants. It would be much
much faster in many cases than scanning through all the documents with the
regex.

~~~
elblanco
Nearly forgot this similar effort: <http://research.microsoft.com/en-
us/projects/rex/>

------
nuclear_eclipse
Congrats on supporting back-references :)

Hint: Try using "1(0+)1\1" or "([a-zA-Z]+)\1" for some nifty examples of
things that technically _aren't possible_ with true regular expressions, but
_are_ supported by enhanced regex engines like Perl/PCRE.

Edit: too bad it only works with one back reference, eg, adding \2 apparently
breaks everything, returning no matches and no code.

Edit2: after more playing, it seems that using \3 works as a backref for the
second matching group... try "(a+)(b+)\1\3"...

------
pjscott
It doesn't seem to be able to handle character classes of the type [^abc],
which matches any character _except_ a, b, or c. It's pretty cool, though. I
tried copying and pasting some crazy regexps from the internet in there, and
it immediately gave me some idea what they do. Well done!

~~~
rndmcnlly0
I purposely punted on inverted character classes because they brought up a
major usability issue that I didn't have a quick solution for. The expression
/[^abc]/ should match '@', ' ', and 'q', but, by the numbers, the majority of
generated matches would look like '{angry unicode that you can't even see}'.
This brings up the idea that you don't just want valid matches, but a
meaningful set of examples that covers different ways of matching.

Likewise, I imagine it might be useful to specify some background language
that matches should be taken from. That is, you might want to tell it to show
you example matches made from snippets of valid html, from email-style English
text, ascii-only strings, arbitrary unicode, etc. Incorporating this kind
requires a lot of original thinking about usability (and some prototyping to
figure out what would even make sense).

~~~
abecedarius
Since the intersection of two regular languages is a regular language, you
could reduce generating 'background language' example matches to taking
intersections. It's unfortunate how rare it is for libraries to provide you
that operation.

(Same thing for HTML: intersecting a regular language with a context-free one
is context-free.)

------
jimbokun
Bookmarked. I had an immediate use for this, and your application filled it
beautifully.

------
JangoSteve
This first thing I tried to do was

    
    
      [\d]+
    

But it didn't work. Took me a few tries to get it to work. I'm guessing it
simply doesn't work with special regex characters.

You know what would be even more useful for me? I give you a string and
highlight the important part I need to match, and you give me a good regex for
it.

Example: I need to match the string "View conversation (5)", with the
important part being that there is a number near the end of the string. So I
type that and highlight the "5". Then you give me something like:

    
    
      /[^\d]+\d+[^\d]*/
    

Or whatever. Obviously it won't be full-proof because you don't know all my
use cases and edge cases, but it'd give me somewhere to start.

~~~
jsankey
My CS honours thesis involved deriving regexes (well, DTDs for XML documents,
but it is essentially the same problem).

The quality of the results depends a lot on the amount and quality of the
input (how well it represents what you are really trying to match). A single
example string is unlikely to get you far - you would certainly need multiple
matching examples. So I think this approach works best when you already have
an example corpus to work from, rather than providing input manually. If
you're going to spend effort providing a lot of input, then you'd probably be
better off spending at least some of that effort in providing hints or
possible regex answers.

Further, there are many possible regexes that would match an input set, so the
algorithm also needs some way to evaluate them and choose the better
candidates. In my case I used some ideas from information theory (such as
MML), which actually worked reasonably well. But this is a computationally
hard problem, so even with an objective measure of the optimal regex you won't
necessarily be able to find it.

------
kenjackson
In grad school I had a need for a tool that would generate all strings
accepted by a regex of a specified size.

I wrote one myself, but was never really satisfied that I did a good job on
it. I'd love to see the code of an efficient implementation of such a thing.

~~~
abecedarius
Doug McIlroy wrote an article on that:
<http://www.cs.dartmouth.edu/~doug/nfa.ps.gz>

Code: <http://www.cs.dartmouth.edu/~doug/nfa.hs>

~~~
kenjackson
Thank you. That is actually an excellent paper. A decade late for me, but
excellent.

------
jusob
You mean you re-implemented String::Random ?
[http://search.cpan.org/~steve/String-
Random-0.22/lib/String/...](http://search.cpan.org/~steve/String-
Random-0.22/lib/String/Random.pm)

~~~
rndmcnlly0
Interesting module, but different target application. Part of the core of my
generation process is remembering which parts of the example match came from
which piece of your original expression so that (someday) there could be some
nice mouseover effects that help you understand HOW some example match works
(as there are sometimes expressions that match a given string in multiple
ways).

------
rndmcnlly0
Here is the source to the python meat of this project:
<http://pastie.org/1180230>

The general idea is that it is a command line tool that reads a regex as input
(along with a number of examples to generate). It parses the regex into an
abstract syntax tree, then hands it over to an interpreter to "execute" the
expression/program several times.

The 'simpleparse' python library does a nice job of easing the regex-to-tree
mapping.

------
xtacy
Reminded of this: <http://news.ycombinator.com/item?id=1387418>. Perhaps it's
too complicated? :-)

------
j_baker
One thing that would be useful is if I could select different regex engines.
Python does things _slightly_ differently from Perl, C#, and (especially)
emacs.

~~~
rndmcnlly0
Heh, its a matter of implementing a declarative model of each of the different
regex engines you'd like to select. If would be fantastic if it existed, wanna
try?

------
gus_massa
I think that it should show for example 10 matches (not only one), so it is
easier to have an idea of "all" the possible matches.

------
DTrejo
I've found <http://txt2re.com/> to be very helpful for simple things.

------
sharpemt
Something similar to check-out. Based in javascript and lets you visually
verify basic patterns on text-blocks: <http://regexpal.com/>

------
nerfhammer
Using python with the re.X flag?

I can tell because it's doesn't understand scoped flags or atomic groupings.

~~~
rndmcnlly0
While the generation engine is written in python (source linked in another
comment), the engine itself doesn't make use of existing regex libraries
because they solve a different problem (deterministic matching vs
nondeterministic generation).

------
swaits
Similar: <http://rubular.com/>

~~~
rndmcnlly0
It look like rubular just shows you example matches out of text you've already
typed in (the same situation you have when playing with any regex matching
library directly), whereas regexio generates matches you might not expect out
of the blue. Can you clarify what sense of similarity you mean to point out?

~~~
swaits
Yes, I realize that rubular doesn't generate the matches. But they are both
web sites where you primarily enter a regex in order to explore its behavior.
The similarity is obvious.

------
frobozz
What an exceptionally helpful tool! thanks.

