Hacker News new | past | comments | ask | show | jobs | submit login
[dupe] Frak takes an entirely different approach to generating regular expressions (thechangelog.com)
27 points by joeyespo on Aug 15, 2013 | hide | past | favorite | 25 comments



I get the motivation for writing it.

I am not sure I see wisdom in using it.

The typical use case for regular expressions is to find patterns in an arbitrary unknown string.

There can be multiple regular expressions which match a particular string or set of strings because of the Kleene Star - i.e. the examples:

  "foo" "bar" "baz" "quux"
are all matched by

  (?:ba[rz]|foo|quux)
but also by

  [:alpha:]* | [:lower:]*  etc.
Having started the port of VerbalExpressions to Racket, the fundamental issue is that the only practical way to express an automata requires the use of one of the forms which can express an automata. TANSTAAFL.


This isn't something I would recommend using for everyone. But it does have use cases that make it appealing and useful. The same thing could be said about VerbalExpressions.

While it's true `[:alpha:]|[:lower:]` (even better `([:alpha:]|[:lower:])` or just `[:alpha:]`) would work to match "foo", "bar", "baz", etc.; it would also match "turduken" or epsilon (""). This isn't always what you want.

One nice thing about generating a regular expression in this manor, is it's arguably much easier to maintain. Instead of having a heavy pattern to refactor or, god help you, append to, you simply add another string to your list and be on your way. Plus you get the potential memory savings and performance benefits.

In the case of VerbalExpressions, the `or` method could use the same technique as frak to improve the quality of the expressions it emits. While on the subject, VerbalExpressions could also borrow some good ideas from is Christophe Grand's regex library: https://github.com/cgrand/regex.

Edit: Forgot some colons.


I don't disagree with any of your points. But I don't think they address the problem inherent in Frak - there's no way to insure that the expression gives the intended result against an arbitrary input.

The results with "barber" are indeterminate since the input string "bar" matches (^bar$) as well as (bar). In other who knows what language is described by the output of Frak?

I am not in love with VerbalExpressions - porting it to Racket was a reasonably sized project from which I felt I could learn something about Racket, I did. And something about regular expressions, which I also did.

I think VerbalExpressions suffers from a bit of the same conceptual problem as Frak - though perhaps to a lesser degree. That problem is treating Regular Expressions as something other than the formal description of a language.

However, I think that the transition from a muddled concept of regular expressions to correct one is more straight forward with VE - i.e. adding (kleeneStar), (atLeastOne), and (exactlyOne) to VE doesn't break the bigger idea.

The general problem with anything that attempts to implement regular expressions informally can be summed up as:

  Pick any two:
   
   ()Kleene Star

   ()Union

   ()Concatenation
It's a "cheap, fast, correct" problem.


> "But I don't think they address the problem inherent in Frak - there's no way to insure that the expression gives the intended result against an arbitrary input."

I've made sure to thoroughly test the patterns produced by frak. They've also passed muster with another project where frak has proven to be a good idea (see this issue https://github.com/guns/vim-clojure-static/pull/28). While they are not full optimized (yet), I'm fairly sure they are correct. If you have found a counter example, please open an issue.

> "The results with "barber" are indeterminate since the input string "bar" matches (^bar$) as well as (bar)."

The next point release will allow users to specify an exact match. Initially, this wasn't an option because Clojure has two functions `re-find` and `re-match` for relaxed and strict matches respectively.

  user> (re-find #"bar" "barber")
  "bar"
  user> (re-find #"^bar$" "barber")
  nil
> "I think VerbalExpressions suffers from a bit of the same conceptual problem as Frak - though perhaps to a lesser degree."

There is nothing conceptually wrong with what VerbalExpressions or frak attempt to do. They may not provide value to everyone, but they are useful contributions and have tangible benefits.

> "That problem is treating Regular Expressions as something other than the formal description of a language."

Why is that a problem?

It has been shown it's possible to use regular expressions as a means to perform arithmetic. Is that too a problem?

I don't think so.

Formality is a great thing and necessary. But it can also be the opposite. While I'll agree understanding it is important, I'll contend getting hung up on it is dangerous.


I have very limited experience around regexes, but would it be a good idea for the software to perhaps try and show multiple different regexes that would match? I'd imagine that's more difficult, if we're talking narrow scope and then broadening out... there's an interesting UI to be built around that.


I have limited experience as well. It's the reason I decided to try a port of VerbalExpressions. As I've learned more, I recognize that most of my puzzlement in regards to regular expressions is based on a misunderstanding of what they actually are.

In ordinary language it is simple to say, "A regular expression describes a pattern to be matched."

Mathematically, however, a regular expression describes a language. The typical use case is to test strings for membership within the language.

What regular expressions do is provide a formal description of a language. Once we attempt to construct regular expressions based on informal descriptions of a language, we are constructing arbitrary languages, and the only way to test whether an arbitrary language is the formal language we wish to construct is to construct the formal language and compare it to the arbitrary one.

There are two ways we can construct the formal description - as a regular expression or as the set of all legal strings.

Frak replaces formal description with a formal enumeration of all legal strings. But what does this get us in terms of regular expressions? We don't need regular expressions for:

  boolean matcher? [list los string str]
      [foreach [token in str]
               [test memberof? los truth-value=or]]

  matcher? ["foo" "bar" "baz" "quux"] ["barber"];
In other words, how do we get

  (?:ba[r-z]|foo|quux)
by giving examples without including ["bas" "bat" "bau" "bav" "baw" "bax" "bay"] in the input list to Frak?

Then consider that some contexts are case sensitive. Which of ["bar" "Bar" "bAr"..."BaR"..."BAR"] are and aren't legal?


I would not use this as a replacement to test the membership of a string in a collection of strings. That's a terrible use case. I've done benchmarks comparing membership checking and regular expression testing and the former is almost always significantly faster and requires less overhead.

With regard to ranges, it's another one of the areas I have yet to look in to. The question you have to answer is: will [r-z] generate fewer states?

As far as case insensitivity, well, that's usually just an optional flag.


>"I would not use this as a replacement to test the membership of a string in a collection of strings."

But that's what a regular expression is used for - testing an arbitrary string for membership within the set of valid strings of the language formally described by the regular expression.

The power of a regular expression is that it can enumerate all the valid strings for me. If I have to explicitly list them, what have I gained? To put it another way, the equivalent output to the example is:

  (?:foo|bar|baz|quux)
it is one character longer than what was produced

  (?:ba[rz]|foo|quux)
but can reasonably argued to be clearer.

What I was getting at with my pseudo-code example is that if the goal is to interpret the input down to the fewest possible states from examples, then the regular expression is redundant - all we need are the examples and `if`.

There's shorter syntax:

  > (frak/pattern [Clojure|Clojars|ClojureScript])

  #"(?:Clojure|Clojars|ClojureScript)"
Why not use the simplest possible syntactic sugar?

As I said in my first comment, I understand the reasons for creating Frak. I find thinking about it stimulating and illuminating. It provides a great jumping off point around the of the issue of unpacking regular expressions and the terseness of their language.


> "But that's what a regular expression is used for - testing an arbitrary string for membership within the set of valid strings of the language formally described by the regular expression."

Formally yes. And if it were always more performant to use a regular expression for this task, I would encourage the use of this tool to do so. However, depending on the data structure containing the strings, it may be more performant to simply search that.

> "it is one character longer than what was produced"

It's not about the length of the pattern that matters. Rather, it's the performance characteristics of the underlying state machine once the expression is compiled.

> "Why not use the simplest possible syntactic sugar?"

  (?:Clojure|Clojars|ClojureScript)
While this is certainly easier to read and understand, it will have performance drawbacks when using an NFA engine where backtracking is a real thing. It will also have a larger number of states when compared with the alternative.

  Cloj(?:ure(?:Script)?|ars)
Suppose I am interested in testing if "ClojureScript" is a member of the set of strings described by the first expression. To be in a final state I will have to enter no fewer than 25 states and backtrack twice. With the second expression I will only need to enter 13 states before being in a final state and will not backtrack at all.

For small patterns the choice to use something like frak is arguably splitting hairs; you won't gain much other than you didn't have to write an expression. But for enormous patterns, like the one I share in the README, there are real benefits from the sort of optimization frak provides.


I just wanted to point out that nowadays "qux" generally comes before "quux" in the standard series.

http://www.catb.org/~esr/jargon/html/Q/qux.html


Sometimes, one of your regular expressions is "keyword" or the like. Auto-generating that one regex spares the usual concatenation of the literal options into one relatively inefficient monster.


This is an excellent use case.


I really like the idea... but so far, it seems limited to actual characters, with no "*" or "+" expansion, no collapsing whitespace into "\s", etc.

So it doesn't seem like you could feed it "a", "aaa", "aaaaa" and then get "a+".

But even if it did, it would seem there would be so many edge cases in what the "correct" regular expression should be, that specifying them all would be just as hard as writing the regular expression in the first place.

But nevertheless, definitely a cool little toy. Would be great to have a web front end to test it out, as webjames said.


You're asking for implicit induction, which is not what the user would want in the general case. "a" "aaa" "aaaaa" could be expanded to "a(aa){0,2}" through deduction, and no more.

since frak is using a trie, this kind of encoding should be a conceptually straightforward extension of the existing parse structure.


These expansions are certainly possible and I'm planning to investigate whether they have performance benefits. One interesting discovery we made last weeks is that `(a|b|c)` actually produces a larger state table than `[abc]`. This was both the case for Java and Vim. While the performance gains are minute, nevertheless, they are gains.

I haven't checked if "a{4}", for example, is faster than "aaaa" or produces fewer states. But it's something I am interested in. I just haven't had enough time this week.


For those who are missing wildcards:

Wildcards are not possible when going from a restricted input to a finite automaton. However, if you aren't using say ø, just make your input to frak "fooøbar" then add your wildcard to the generated regex.

I agree the output could be more compact, and collapse certain conventional characters (\n, \t and the like). The tool itself has all the power one should need.


Great work - I'd love it if someone would write a web front end to this.


There will be a JavaScript version available this later week, both for nodejs and the browser. If you want to try out the command line version see issue #2 for instructions.


It's on Github - there's nothing stopping anyone who wants it from implementing it.


This is pretty cool - nice work! Can't say I've seen a trie-based construction like this, but it's clever and looks like a near-perfect fit.


Apparently the Perl community has been doing this for years. :)


GitHub is currently suffering a big DDoS, here's the Google Cache version:

http://webcache.googleusercontent.com/search?q=cache:PXnt0AW...


Is there analogy implemented in javascript?


Someone told me it was present in emacs, here's an old source code for it http://stuff.mit.edu/afs/sipb/contrib/emacs/packages/w3m_el-...


On it's way this week. See the reply above.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: