Hacker News new | past | comments | ask | show | jobs | submit login
Simple Helper to Extract Values from a String (laktek.com)
83 points by laktek on Oct 4, 2012 | hide | past | favorite | 48 comments

This looks like a reasonable implementation of a Type-3 star-free grammar. I think it's a shame the author eschewed the idea of making it bijective with e.g. Python's string.format or Java's MessageFormat.format - is there a formatting library in Javascript that outputs strings from a "{var_name}" and "`var_name`" template style?

The linkbait title ("Too lazy to write Regular Expressions?") definitely rubs me the wrong way. The syntax for Perl-compatible regular expressions is horrendous - it's difficult to identify the isomorphism with the underlying ndfa, which in my view is the most intuitive representation, it's practically uncomposable, and it's far more dense with non-alphanumeric characters than any programming language other than something like J.

I guess what I was hoping for when I clicked this is what you might call a "lexer combinator" - something like (Python syntax):

  >>> combinator = Literal('/') + Group("year", OneOrMore(Digit())) + 
  ...   Literal('/') + Group("month", OneOrMore(Digit())) +
  ...   Literal('/') + Group("day", OneOrMore(Digit())) +
  ...   Group("title", OneOrMore(Letter())) + Literal('.html')
  >>> lex(combinator, '/2012/08/12/test.html')
  {'title': 'test', 'month': '08', 'day': '12', 'year': '2012'}
  >>> qs_combinator = combinator + Literal('?') +
  ...   Group("qs", ZeroOrMore(
  ...     Group("key", OneOrMore(Letter())) + Literal('=') + Group("value", OneOrMore(Letter()))))
  >>> lex(qs_combinator, '/2012/08/12/test.html?show=full')
  {'title': 'test', 'qs': [{'value': 'full', 'key': 'show'}], 'month': '08', 'day': '12', 'year': '2012'}
Obviously there are opportunities to make that less verbose, though I don't know if you can ever make it as close to PCRE syntax as parser combinators get to EBNF--perhaps maybe you could do something like 'a'*1 to represent 'a'+ in PCRE. Does such a library exist?

Scala's Parser-Combinators are about as good as you are going to get http://www.scala-blogs.org/2009/04/combinators-for-pretty-pr.... Consider the grammar for simple math expressions:

    expr     ::= sum
    sum      ::= product { ("+" | "-") product }
    product  ::= power   { ("*" | "/") power }
    power    ::= factor  [ "^" factor ]
    factor   ::= "(" expr ")" | variable | constant
    variable ::= identifier
    constant ::= floatLiteral
It's pure-Scala is

    def expr     = sum
    def sum      = product ~ rep(("+" | "-") ~ product)
    def product  = power   ~ rep(("*" | "/") ~ power)
    def power    = factor  ~ opt("^" ~ factor)
    def factor   = "(" ~ expr ~ ")" | variable | constant
    def variable = ident    
    def constant = floatLit 
And since it's pure-Scala, you get all the benefit of static typing (error detection, auto-complete, etc) when writing your parser. There's also no extra compile step: your parser compiles with the rest of your program and is just a java Object with a parseAll() method.

Lastly, it's a superset of PEG, which means that its behavior and limitations are all pretty well-studied academically and well-defined (which is more than can be said of most hacked-together regex solutions!).

Pyparsing is pretty cool as well, for python, but it's somewhat more verbose since python isn't as DSL-friendly as scala: http://pyparsing.wikispaces.com/HowToUsePyparsing

EDIT: a (different) example from Pyparsing:

    integer = Word( nums ) 
    variable = Word( alphas, max=1 )
    arithOp = Word( "+-*/", max=1 )
    equation = variable + "=" + integer + arithOp + integer
As far as I can tell, Pyparsing is also a superset of PEG.

This may be of some interest to you: http://lfw.org/python/rxb.py -- it hasn't caught on since 1996 though.

I've seen some swear to OMeta which is the parser used in the VPRI "simple OS" product, powerful enough to implement a bare-bones TCP/IP implementation as essentially drawings; http://www.moserware.com/2008/04/towards-moores-law-software...

That second link is beautiful. Thanks!

I haven't used it, but I believe Pyparsing does much or what you are describing.

The declarative naming is more like 'OneOrMore(Digit)("year")'.

Let's rewrite the examples using regular expressions (Python but that's not extremely important):

  import re


  re.match(r"^(?P<name>.+?) <(?P<email>.+?)> \((?P<url>.+?)\)$",
           "John Doe <john@example.com> (http://example.com)"

  re.match(r"^from (?P<from>.+?) to (?P<to>.+?)$",
           re.sub(r"\s+", " ", "from 4th October  to 10th  October")

  re.match(r"^convert (?P<quantity>.+?) (?P<from_unit>.+?) to (?P<to_unit>.+?)$",
           "Convert 1500 Grams to Kilograms".lower()
This didn't take long to write but even like this I would vastly prefer the proposed syntax of extract-values which I find much easier to remember and to write. So, yes, we can write regular expressions but why should we if for certain simple cases (and the rule is that most cases are simple) there is an easier way to do the same thing?

Think, for example, the built-in support for templates in Python (string.Template). Templates offer nothing that cannot not be done with the standard formatter but they are simpler to use. In a sense this proposal is the dual of templates, as it offers easy parsing for a restricted number of cases. And it definitely doesn't deserve all the negativity.

Having said that, I don't think I would use such facility unless it were part of the standard library. Writing a regular expression now and then is easier (at least for me) than having to keep around an extra library that, even with the help of pip and friends, I would have to re-install every time I change my environment (upgrade Python, setup a new machine etc.).

For Python libraries, I used to shy away from adding extra modules, just because I didn't want to install things in the system Python site-packages.

Then I started using virtualenv for everything, and my life has been much simpler. Now, I have a separate virtualenv setup for each project, and a simple script/Makefile to download and build dependencies. I also add a short shell script to wrap each project that automatically sets up the virtualenv environment and runs the appropriate Python script.

Needless to say, I'm very happy that virtualenv itself has been added to the standard library.

You can use tiny modules like this with near reckless abandon in node. Due to the way node's require() works and the way the package manager installs modules locally by default (rather than system-wide), most of the FUD of using 3rd party modules in other environments simply doesn't apply in node.

A neat idea, but I'm against hiding the underlying regular expressions from a user. They're such a powerful tool, anyone using them for matching or extraction should read up on them. That said, for simple use this is cool and probably something that would appeal to beginners.

A 0 config 0 feature simplified version in Ruby https://gist.github.com/3832504

We hide and abstract things in programming all the time.

For example, the proper (RFC-compliant) regex for an email is very complicated and often implemented the wrong way.

As somebody who has written a lot of regexes, I'd rather uses this library which has the correct abstraction than google the regex for http-urls for the 512th time.

This library doesn't help you here though. It doesn't bring you any abstractions on top of regexes (e.g. parsing URLs/emails). This is merely a different syntax for matching parts of a string.

I hope no one is seriously suggesting that regular expressions should be used in programs, such as email address verification and http parsing. As well as being incredibly slow, they are hard to read, and inferior to application-specific parsers (for example, sometimes non-RFC-compliant emails are actually valid, and dealing with whitespace in html is a nightmare).

For the interactive case, the ability for them to be written quickly is what makes them so helpful, and an abstraction library could take away this advantage.

I'm not exactly disagreeing but just curious:

1. How can a non-RFC-compliant email be valid?

2. What about compiled regexes, performance-wise?

3. Sometimes a regex is faster than the overhead of a parser, so wouldn't the choice be dependent on context? In other words, regexes are not always slower, true?

4. Wouldn't some abstraction libraries utilize regexes under the hood? Would that be wrong in your view?

P.S. Some languages allow the option for very readable regexes, e.g. separate each component on its own line, with a comment.

> How can a non-RFC-compliant email be valid?

甲斐@黒川.日本 is a non-RFC 5322-compliant, but still valid, email address.

Unless you're implying that "valid" === RFC 5322-compliant, in which case the example isn't valid ;)

The best way to validate an email address: send an email to that email address containing a confirmation link. Simple, easy.

Ah, understood. I was thinking "valid == well-formed" without knowing whether it really works (i.e., could be deleted), whereas I see you rightfully point out that it more reasonably means "it works." Thank you, makes sense.

Simply, some sites don't enforce the full set of RFC rules, as such people actually have non-RFC-compliant email addresses that are valid.

How can you 'compile' a regular expression?

For very simple regular expressions, they might be decently fast, but as soon as you start pulling out the more complicated regular expressions needed for parsing, you get slower. Even simple repeats can have a lot of overhead if not used correctly, have a look at "Looking Inside The Regex Engine" at this link http://www.regular-expressions.info/repeat.html. An equivalent parser doesn't need to do any form of backtracking, and doesn't care about the structure. For example, I've seen an application use regular expressions for html parsing. After spending a while figuring out what they actually did, I found the source html had changed its whitespace, but not the DOM structure, which broke the regular expressions.

As for my reasoning above, I think a lot of 'abstraction' libraries would be faster by operating directly on the data, instead of just converting it to regular expressions. The beauty of regular expressions is the speed at which they can be written.

I hope no one is seriously suggesting that regular expressions should be used in programs, such as email address verification and http parsing.

I think the comma changes the meaning of your sentence from what you intended. It was refreshing, though!

To be clear, with the comma, this reads essentially as "Don't use regexes in programs. Some examples of programs are ...".

If you replace the `\S+` with a non-greedy wildcard (`.*?`), you gain a lot more flexibility:


This examples looks like it covers a large number of cases:

      extractValues('John Doe <john@example.com> (http://example.com)', '{name} <{email}> ({url})')
    >> {'name': 'John Doe', 'email': 'john@example.com', 'url': 'http://example.com' }
What kinds of problems would you run into vs using regex? My thoughts were that you lose some control of your matches.

You lose the ability to write them quickly, which considering I mainly use them for code editing or one-off uses, is quite important. The thing about regular expressions is they aren't suited to general programming in their normal form (since they are hard to read, and slow), and the quoted example is better described as a general parser, and could be implemented much better directly then abstracting through regular expressions.

Cool, I love having stuff like this out there. It may not always be the right tool for a given problem, but sometimes it will be. I imagine I'll have cause to use it sometime this year, so thanks!

In particular, I could see this being useful in a situation where you want the validation "rule" to be readable by non-programmers (perhaps if building a system where an end-user can choose amongst a set of canned validation rules for custom fields that they can add to something-or-other). In such a case, this simple grammar is something you could display in whatever "configurator" UI the user has. In general I like stuff that both programmers and non-programmers (or just a programmer that doesn't feel like being a programmer right then) can understand.

Thanks for taking the time to post it laktek.

RegExs...fun to figure out and to write the first time. Pain to debug and supports x-number of months in the future, especially by other developers on anything more than the common cases.

I think this is a pretty useful idea.

There are some things that can help with this: 0. Consider whether string.split could do part of the job instead. 1. Consider writing unit tests for your regexes. 2. Split your regex onto multiple lines. 3. Turn on the option that lets you include comments inside your regex.

I do agree that really complicated regexes get to be a pain.

I try to assuage my pain and that of my devs by actually....commenting the REGex. The comment explaining things is often longer than the regEx, but helps quite a bit.

This should have been titled "Too lazy to learn regular expressions"

Once you have learned them, using them is the epitome of laziness.

> using them is the epitome of laziness

I think you meant 'writing them'. Reading, debugging and understanding them is often soul-crushing.

I'm not at all convinced by this, I'm afraid. Regexes are intimidating at first, but when you get the hang of them they are a fabulously powerful tool that absolutely every programmer should learn.

As for 'laziness', I don't buy that regexes take any longer to write than the linked method. The third example can be done (in python, but I'm sure the ruby equivalent is just as simple) with:

  re.sub(r'from\s+(\w+)\s+(\w+)\s+to\s+(\w+)\s+(\w+)', r"{'from': '\1 \2', 'to':'\3 \4'}", "from 4th October  to 10th  October")
Which is only ten characters longer, requires no external dependencies and can be used with very little modification in (almost) every other language.

That's harder to read, partly but not entirely because you need to count parens and cross reference, and returns a string rather than a dict.

I was really struggling to get my head around regular expressions until I found this site http://rubular.com/.

I found the visual way to shows the matches to be really useful.

I've found this to be useful personally. http://gskinner.com/RegExr/

Regexes have one huge flaw: the syntax. If they had whitespace in them (like CoffeeScript), and required quoting the characters you were looking for (to distinguish from syntax), then paired with a little syntax highlighting they would be far easier to read and write.

Ruby (borrowed from Perl and common in many regex impls I believe) has the "x" modifier to ignore whitespace and comments in literal regexes, for example:

    str = "2012-10-04T07:20:00"

    str =~ /
        (\d{4}) # four digit year
        (\d{2}) # two digit month
        (\d{2}) # two digit day
        (\d{2}) # two digit hour
        (\d{2}) # two digit min
        (\d{2}) # two digit sec

And if your implementation doesn't have it, it's not hard to add.

E.g. for JS: http://blog.mackerron.com/2010/08/08/extended-multi-line-js-...

And that flaw is compounded by subtly differing syntaxes among implementations, e.g., vim, python, grep, egrep.

So, it's basically a more hidden version of Regexp::Common?


Template::Extract appears closer.


Here's something equivalent to the example in the OP:

    #!/usr/bin/env perl
    use Template::Extract;
    use Data::Printer;
    sub extractValues {
        my ($document, $template, $opts) = @_;
        my $delims = $opts->{delimiters} || [ qw/ { } / ];
        $document = lc $document
            if $opts->{lowercase};
        if (exists $opts->{whitespace}) {
            my $ws = ' ' x $opts->{whitespace};
            $document =~ s/\s+/$ws/g;
        Template::Extract->new( { START_TAG => $delims->[0], END_TAG => $delims->[1] } )
                         ->extract( $template, $document );
    p extractValues('/2012/08/12/test.html', '/{year}/{month}/{day}/{title}.html');
    p extractValues('John Doe <john@example.com> (http://example.com)', '{name} <{email}> ({url})');
    p extractValues('from 4th October  to 10th  October',
                      'from `from` to `to`',
                      { whitespace => 1, delimiters => ['`', '`'] });

    \ {
        day   12,
        month   08,
        title   "test",
        year   2012
    \ {
        email   "john@example.com",
        name   "John Doe",
        url   "http://example.com"
    \ {
        from   "4th October",
        to   "10th  October"

eh. I use regular expressions often enough that I am pretty fast at them for common cases, and usually get them right on the first try. This seems like it would only be useful if you don't really use regex's that often.

For the case of users, I don't think I would trust most users (as the article suggests) to writing their own pattern matching.

I was hoping for a visual editor or something similar when I read the title.

Regular expressions are not hard to understand (once you understood finite state machines), but the syntax is hard to read in many cases.

Quite right. I see two issues with regexes: funky syntax and some important and non-obvious limitations. If you want to use them, these two things add up to a considerable barrier to entry. There are many examples of trivial pattern-matching languages (DOS wildcards, UNIX wildcards and SQL text matching to name but three) that address both of these issues by being easy to grasp and surprisingly useful.

I haven't actually tried this, but it seems easy to grasp and I've spent enough time busting up delimited strings that "surprisingly useful" is plausible too.

Love your stuff, laktek. This is a neat tool - I'll definitely being using it for convenience sake.

Looks really nice, very readable. However, I want a python version!

Not identical, but appears to be similar: templatemaker (https://code.google.com/p/templatemaker/)

Congratulations, you've implemented a kind of crappy psuedo-CFG parser.

See SimpleParse (1) for an example of a recursive descent parser implemented in Python accepting EBNF grammars (2).

I was talking with my friend at work today about the fact that most languages don't seem to have a simple library for parsing data so people end up relying on Regexes to get all their data extraction from strings done. He knows much, much more about parsers than I do though and told me to shut up and build one myself. That gave me pause.

1) http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Fo... 2) http://simpleparse.sourceforge.net/

Was there really any need to wrap your (useful) criticism in vile dickishness?

Your comment is the reason why I detest the community on HN. If I would have put my personal time into this, and would read such a comment, I would find it extremely demotivating. The solution may not be perfect but it would be much more helpful if you constructive criticism instead of this.

I was noting this yesterday in the DollarDollarJS thread ... presumably in the quest for "karma," flaming open-source projects has really gotten popular over the past eight months or so on this website (I think that flaming of startup ideas has always been a feature, however).

> most languages don't seem to have a simple library for parsing data

This is what lex and yacc (etc) are for, but they aren't quite simple. Arguably, XML has largely replaced these for "parsing data", since DTDs and XML Schema are CFGs with some minor restrictions (e.g. 1-unambiguous). In many cases, you don't even need a CFG, just nested lists is enough, so JSON fills that role. For even simpler data, just lists of values, there's csv and unix-style line-based formats (use awk etc for parsing).

Interest in CFG's seems to have decreased dramatically since XML (and later JSON) met the simpler needs it served.

Or a simple RFC6570 parser (URI Template [1]) that supports variable extraction. Variable extraction isn't part of the spec [2], but at least one library implements it [3].

[1] http://tools.ietf.org/html/rfc6570

[2] http://tools.ietf.org/html/rfc6570#section-1.4 - "In general, regular expression languages are better suited for variable matching."

[3] https://github.com/sporkmonger/addressable

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact