

frak: Transform collections of strings into regular expressions - jamesjyu
https://github.com/noprompt/frak

======
djvv
Here's the paper that shows how to do the direct construction of the minimal
DFA in linear time:
[http://acl.ldc.upenn.edu/J/J00/J00-1002.pdf](http://acl.ldc.upenn.edu/J/J00/J00-1002.pdf)

~~~
danieldk
And my implementation in Java :):

[https://github.com/danieldk/dictomaton](https://github.com/danieldk/dictomaton)

------
jonstewart
It's quite possibly the case that it takes longer to parse the complex trie
regex (using character classes, ?, etc.) constructed here than it does for a
good regexp library to parse the alternating fixed strings and construct the
equivalent automata.

~~~
roryokane
I don’t know how common such “good regexp libraries” are. The README includes
a benchmark at the bottom, demonstrating that the complex regex is about ten
times faster than the simple version. The benchmark uses Clojure’s built-in
regex engine, which is the same as Java’s.

~~~
FreakLegion
Well, PCRE, the most widely used regex lib in my experience, has a "DFA" mode
(scare quotes because it's actually an _NFA_ mode. See [1]). So the option is
there, even if most people use PCRE in standard recursive kablooey mode.

I was working on a drop-in managed wrapper around re2[2] so .NET developers
would have access to a good regex library, but damned if starting a company
hasn't gotten in the way of finishing it. Still, re2 bindings are available
for most other common platforms.

(Also I don't believe Java's regex engine is automata-based, which would
explain the performance gains.)

1\.
[http://fanf.livejournal.com/37166.html](http://fanf.livejournal.com/37166.html)
2\. [http://code.google.com/p/re2/](http://code.google.com/p/re2/)

------
petercooper
Humorous timing for me as I made a (bad) joke the other day about such a piece
of software: _" I created an algorithm to turn a selection of strings into a
regular expression but it just keeps returning (dot asterisk)"_

In other news, how _do_ you actually render an asterisk on HN?

~~~
neeee
* (asterisk followed by whitespace)

------
mkching
To those wondering if this is worth it, the method of using a trie-style
regexp to optimize this type of word matching has been used in the Perl
community for a while now and it does work well.

Starting with v5.10, if Perl encounters a regexp that is just a list of
strings, it will even use trie-based matching automatically.

Prior to that, people have been using modules such as Regexp::Trie to build
optimized regexps and noted increases in performance similar to the benchmark
shown in the post.

------
dvdkhlng
This functionality has been available in Emacs (ELisp) for a long time:

[http://www.emacswiki.org/emacs/RegexpOpt](http://www.emacswiki.org/emacs/RegexpOpt)

~~~
agumonkey
Here goes my frak.el port attempt

------
fdigger
Here are some links to papers that show how to generate regular expressions
from a set of strings: \-
[http://doi.acm.org/10.1145/1735886.1735890](http://doi.acm.org/10.1145/1735886.1735890)
\-
[http://doi.acm.org/10.1145/1841909.1841911](http://doi.acm.org/10.1145/1841909.1841911)
This is research done in the context of XML schema learning. The first paper
restricts to inference of regular expressions in a more limited class: single-
occurrence regular expressions. These are expressions in which every alphabet
symbol (letter) can occur at most once. When enough strings are present, the
algorithms are sound and complete. The second paper drops this requirement but
the algorithm are mere heuristics.

Learning regular expressions from strings is a fascinating but non-trivial
problem.

If you do not have access to the Digital Library of ACM, scholar.google.com is
your friend.

------
colanderman
Why not just use, say, an actual (i.e. backtrack-free, DFA-based) regular
expression library?

~~~
limmeau
This produces inputs to an arbitrary regex library. You could of course build
a regex "(w1|w2|w3|...|wN)" for words w1...wN (assuming an efficient regex
library) and save the hassle, but the frak output will hopefully be more
compact.

------
Mgccl
Google Analytics has a filter system that allow one to use regular expressions
of length at most 255.

Sometimes, people have long list of strings that have total length more than
255 characters. frak would be a nice tool for compression.

------
wetherbeei
The author found the optimal solution when trying to construct these
expressions: store them in a trie. Because the generated regular expressions
match _only_ the inputs, this solution may find a more compact way to test if
the inputs have lots of overlap.

It would be cool if the inputs weren't matched exactly, and frak could figure
out a general pattern for your inputs (decimals, capitalized words, etc). That
could help newcomers with a starting expression that matches their inputs.

~~~
ori_b
> _It would be cool if the inputs weren 't matched exactly, and frak could
> figure out a general pattern for your inputs (decimals, capitalized words,
> etc)._

The problem is that there are an huge number of possible solutions for any
given input. For example, you could always give a trivial solution: ".*"

For something like that to work, you'd need a large dictionary of common
patterns, and then you'd want to compare against the dictionary to see if it
matches a sequence of common patterns.

I can't imagine that sort of thing being too useful.

~~~
eru
An alternative solution is to provide a list of matches and non-matches, and
look for the shortest or simplest regex that separates the two.

(Finding the actual minimal regex, instead of just a reasonable guess, might
be a computationally tough problem. I guess it's in coNP and might be in NP,
too. An algorithm in P would be nice to find.)

Update: Finding any separating regex would be in P. A separating finite
automaton is easy to find, and then you just convert that into a regex. Now,
how do you find the minimal regex?

~~~
abecedarius
I wrote a brute-force minimal-regex finder once; wish I could remember where I
put it.

~~~
eru
For your entertainment,
[https://github.com/matthiasgoergens/Div7](https://github.com/matthiasgoergens/Div7)
has a regex finder for divisibility. E.g.
[https://raw.github.com/matthiasgoergens/Div7/master/regex7](https://raw.github.com/matthiasgoergens/Div7/master/regex7)
checks decimal numbers for divisibility by 7.

~~~
abecedarius
Very cute. :-)

I found my code and it wasn't exactly it (it just enumerated all regexes in
order of size), but here's a crude solution now:
[https://github.com/darius/sketchbook/blob/master/regex/find_...](https://github.com/darius/sketchbook/blob/master/regex/find_min_re.py)

(In defense of my memory, I had written superoptimizers for other things.)

------
malandrew
This may be a naïve question, but what about telling it what words the pattern
should not match?

~~~
noprompt
Isn't that more or less the same as what it already does? That is by telling
frak which words you want to match you're implying you don't want to match
anything else.

Of course, you could generate two patterns; one with the words you do want to
match and the other with words you don't. Then simply handle the logic in your
program.

------
generj
Huh. That's a neat trick.

I certainly wouldn't stop at the output, but this is a very useful tool to
speed up regex generation.

------
zeckalpha
This is clever. Using the trie is smart, but does it account for suffixes or
infixes?

~~~
zeckalpha
I misunderstood. Instead of representing the set of sequences, I was assuming
it was attempting to extract a general pattern from the inputs, which would be
difficult, and non-deterministic.

