

Simple Helper to Extract Values from a String - laktek
http://laktek.com/2012/10/04/extract-values-from-a-string/

======
hythloday
This looks like a reasonable implementation of a Type-3 star-free grammar. I
think it's a shame the author eschewed the idea of making it bijective with
e.g. Python's string.format or Java's MessageFormat.format - is there a
formatting library in Javascript that outputs strings from a "{var_name}" and
"`var_name`" template style?

The linkbait title ("Too lazy to write Regular Expressions?") definitely rubs
me the wrong way. The syntax for Perl-compatible regular expressions is
horrendous - it's difficult to identify the isomorphism with the underlying
ndfa, which in my view is the most intuitive representation, it's practically
uncomposable, and it's far more dense with non-alphanumeric characters than
any programming language other than something like J.

I guess what I was hoping for when I clicked this is what you might call a
"lexer combinator" - something like (Python syntax):

    
    
      >>> combinator = Literal('/') + Group("year", OneOrMore(Digit())) + 
      ...   Literal('/') + Group("month", OneOrMore(Digit())) +
      ...   Literal('/') + Group("day", OneOrMore(Digit())) +
      ...   Group("title", OneOrMore(Letter())) + Literal('.html')
      ...
      >>> lex(combinator, '/2012/08/12/test.html')
      {'title': 'test', 'month': '08', 'day': '12', 'year': '2012'}
      >>> qs_combinator = combinator + Literal('?') +
      ...   Group("qs", ZeroOrMore(
      ...     Group("key", OneOrMore(Letter())) + Literal('=') + Group("value", OneOrMore(Letter()))))
      ...
      >>> lex(qs_combinator, '/2012/08/12/test.html?show=full')
      {'title': 'test', 'qs': [{'value': 'full', 'key': 'show'}], 'month': '08', 'day': '12', 'year': '2012'}
    

Obviously there are opportunities to make that less verbose, though I don't
know if you can ever make it as close to PCRE syntax as parser combinators get
to EBNF--perhaps maybe you could do something like 'a'*1 to represent 'a'+ in
PCRE. Does such a library exist?

~~~
Erwin
This may be of some interest to you: <http://lfw.org/python/rxb.py> \-- it
hasn't caught on since 1996 though.

I've seen some swear to OMeta which is the parser used in the VPRI "simple OS"
product, powerful enough to implement a bare-bones TCP/IP implementation as
essentially drawings; [http://www.moserware.com/2008/04/towards-moores-law-
software...](http://www.moserware.com/2008/04/towards-moores-law-software-
part-3-of-3.html)

~~~
hythloday
That second link is beautiful. Thanks!

------
cefstat
Let's rewrite the examples using regular expressions (Python but that's not
extremely important):

    
    
      import re
    
      re.match(r"^/(?P<year>.+?)/(?P<month>.+?)/(?P<day>.+?)/(?P<title>.+?)\.html$",
               "/2012/08/12/test.html"
               ).groupdict()
    
      re.match(r"^(?P<name>.+?) <(?P<email>.+?)> \((?P<url>.+?)\)$",
               "John Doe <john@example.com> (http://example.com)"
               ).groupdict()
    
      re.match(r"^from (?P<from>.+?) to (?P<to>.+?)$",
               re.sub(r"\s+", " ", "from 4th October  to 10th  October")
               ).groupdict()
    
      re.match(r"^convert (?P<quantity>.+?) (?P<from_unit>.+?) to (?P<to_unit>.+?)$",
               "Convert 1500 Grams to Kilograms".lower()
               ).groupdict()
    

This didn't take long to write but even like this I would vastly prefer the
proposed syntax of extract-values which I find much easier to remember and to
write. So, yes, we can write regular expressions but why should we if for
certain simple cases (and the rule is that most cases _are_ simple) there is
an easier way to do the same thing?

Think, for example, the built-in support for templates in Python
(string.Template). Templates offer nothing that cannot not be done with the
standard formatter _but_ they are simpler to use. In a sense this proposal is
the _dual_ of templates, as it offers easy parsing for a restricted number of
cases. And it definitely doesn't deserve all the negativity.

Having said that, I don't think I would use such facility unless it were part
of the standard library. Writing a regular expression now and then is easier
(at least for me) than having to keep around an extra library that, even with
the help of pip and friends, I would have to re-install every time I change my
environment (upgrade Python, setup a new machine etc.).

~~~
mbreese
For Python libraries, I used to shy away from adding extra modules, just
because I didn't want to install things in the system Python site-packages.

Then I started using virtualenv for everything, and my life has been much
simpler. Now, I have a separate virtualenv setup for each project, and a
simple script/Makefile to download and build dependencies. I also add a short
shell script to wrap each project that automatically sets up the virtualenv
environment and runs the appropriate Python script.

Needless to say, I'm very happy that virtualenv itself has been added to the
standard library.

------
injekt
A neat idea, but I'm against hiding the underlying regular expressions from a
user. They're such a powerful tool, anyone using them for matching or
extraction should read up on them. That said, for simple use this is cool and
probably something that would appeal to beginners.

A 0 config 0 feature simplified version in Ruby
<https://gist.github.com/3832504>

~~~
VMG
We hide and abstract things in programming all the time.

For example, the proper (RFC-compliant) regex for an email is very complicated
and often implemented the wrong way.

As somebody who has written a lot of regexes, I'd rather uses this library
which has the correct abstraction than google the regex for http-urls for the
512th time.

~~~
keeperofdakeys
I hope no one is seriously suggesting that regular expressions should be used
in programs, such as email address verification and http parsing. As well as
being incredibly slow, they are hard to read, and inferior to application-
specific parsers (for example, sometimes non-RFC-compliant emails are actually
valid, and dealing with whitespace in html is a nightmare).

For the interactive case, the ability for them to be written quickly is what
makes them so helpful, and an abstraction library could take away this
advantage.

~~~
rjbond3rd
I'm not exactly disagreeing but just curious:

1\. How can a non-RFC-compliant email be valid?

2\. What about compiled regexes, performance-wise?

3\. Sometimes a regex is faster than the overhead of a parser, so wouldn't the
choice be dependent on context? In other words, regexes are not always slower,
true?

4\. Wouldn't some abstraction libraries utilize regexes under the hood? Would
that be wrong in your view?

P.S. Some languages allow the option for very readable regexes, e.g. separate
each component on its own line, with a comment.

~~~
Domenic_S
_> How can a non-RFC-compliant email be valid?_

甲斐@黒川.日本 is a non-RFC 5322-compliant, but still valid, email address.

Unless you're implying that "valid" === RFC 5322-compliant, in which case the
example isn't valid ;)

The best way to validate an email address: _send an email to that email
address_ containing a confirmation link. Simple, easy.

~~~
rjbond3rd
Ah, understood. I was thinking "valid == well-formed" without knowing whether
it really works (i.e., could be deleted), whereas I see you rightfully point
out that it more reasonably means "it works." Thank you, makes sense.

------
nateweiss
Cool, I love having stuff like this out there. It may not always be the right
tool for a given problem, but sometimes it will be. I imagine I'll have cause
to use it sometime this year, so thanks!

In particular, I could see this being useful in a situation where you want the
validation "rule" to be readable by non-programmers (perhaps if building a
system where an end-user can choose amongst a set of canned validation rules
for custom fields that they can add to something-or-other). In such a case,
this simple grammar is something you could display in whatever "configurator"
UI the user has. In general I like stuff that both programmers and non-
programmers (or just a programmer that doesn't feel like being a programmer
right then) can understand.

Thanks for taking the time to post it laktek.

------
taude
RegExs...fun to figure out and to write the first time. Pain to debug and
supports x-number of months in the future, especially by other developers on
anything more than the common cases.

I think this is a pretty useful idea.

~~~
GlennS
There are some things that can help with this: 0\. Consider whether
string.split could do part of the job instead. 1\. Consider writing unit tests
for your regexes. 2\. Split your regex onto multiple lines. 3\. Turn on the
option that lets you include comments inside your regex.

I do agree that really complicated regexes get to be a pain.

~~~
taude
I try to assuage my pain and that of my devs by actually....commenting the
REGex. The comment explaining things is often longer than the regEx, but helps
quite a bit.

------
smegel
This should have been titled "Too lazy to learn regular expressions"

Once you have learned them, using them is the epitome of laziness.

~~~
Jare
> using them is the epitome of laziness

I think you meant 'writing them'. Reading, debugging and understanding them is
often soul-crushing.

------
macavity23
I'm not at all convinced by this, I'm afraid. Regexes are intimidating at
first, but when you get the hang of them they are a fabulously powerful tool
that absolutely every programmer should learn.

As for 'laziness', I don't buy that regexes take any longer to write than the
linked method. The third example can be done (in python, but I'm sure the ruby
equivalent is just as simple) with:

    
    
      re.sub(r'from\s+(\w+)\s+(\w+)\s+to\s+(\w+)\s+(\w+)', r"{'from': '\1 \2', 'to':'\3 \4'}", "from 4th October  to 10th  October")
    

Which is only ten characters longer, requires no external dependencies and can
be used with very little modification in (almost) every other language.

~~~
philh
That's harder to read, partly but not entirely because you need to count
parens and cross reference, and returns a string rather than a dict.

------
stephenheron
I was really struggling to get my head around regular expressions until I
found this site <http://rubular.com/>.

I found the visual way to shows the matches to be really useful.

~~~
buttscicles
I've found this to be useful personally. <http://gskinner.com/RegExr/>

------
TazeTSchnitzel
Regexes have one huge flaw: the syntax. If they had whitespace in them (like
CoffeeScript), and required quoting the characters you were looking for (to
distinguish from syntax), then paired with a little syntax highlighting they
would be far easier to read and write.

~~~
pmahoney
Ruby (borrowed from Perl and common in many regex impls I believe) has the "x"
modifier to ignore whitespace and comments in literal regexes, for example:

    
    
        str = "2012-10-04T07:20:00"
    
        str =~ /
            (\d{4}) # four digit year
            -
            (\d{2}) # two digit month
            -
            (\d{2}) # two digit day
            T
            (\d{2}) # two digit hour
            :
            (\d{2}) # two digit min
            :
            (\d{2}) # two digit sec
           /x

~~~
gmac
And if your implementation doesn't have it, it's not hard to add.

E.g. for JS: [http://blog.mackerron.com/2010/08/08/extended-multi-line-
js-...](http://blog.mackerron.com/2010/08/08/extended-multi-line-js-regexps/)

------
mhd
So, it's basically a more hidden version of Regexp::Common?

<https://www.metacpan.org/module/Regexp::Common>

~~~
prakashk
Template::Extract appears closer.

<https://metacpan.org/module/Template::Extract>

Here's something equivalent to the example in the OP:

    
    
        #!/usr/bin/env perl
         
        use Template::Extract;
        use Data::Printer;
         
        sub extractValues {
            my ($document, $template, $opts) = @_;
         
            my $delims = $opts->{delimiters} || [ qw/ { } / ];
            $document = lc $document
                if $opts->{lowercase};
            if (exists $opts->{whitespace}) {
                my $ws = ' ' x $opts->{whitespace};
                $document =~ s/\s+/$ws/g;
            }
            Template::Extract->new( { START_TAG => $delims->[0], END_TAG => $delims->[1] } )
                             ->extract( $template, $document );
        }
         
        p extractValues('/2012/08/12/test.html', '/{year}/{month}/{day}/{title}.html');
         
        p extractValues('John Doe <john@example.com> (http://example.com)', '{name} <{email}> ({url})');
         
        p extractValues('from 4th October  to 10th  October',
                          'from `from` to `to`',
                          { whitespace => 1, delimiters => ['`', '`'] });
         

Output:

    
    
        \ {
            day   12,
            month   08,
            title   "test",
            year   2012
        }
        \ {
            email   "john@example.com",
            name   "John Doe",
            url   "http://example.com"
        }
        \ {
            from   "4th October",
            to   "10th  October"
        }

------
afhof
eh. I use regular expressions often enough that I am pretty fast at them for
common cases, and usually get them right on the first try. This seems like it
would only be useful if you don't really use regex's that often.

For the case of users, I don't think I would trust most users (as the article
suggests) to writing their own pattern matching.

------
lignuist
I was hoping for a visual editor or something similar when I read the title.

Regular expressions are not hard to understand (once you understood finite
state machines), but the syntax is hard to read in many cases.

~~~
pmccool
Quite right. I see two issues with regexes: funky syntax and some important
and non-obvious limitations. If you want to use them, these two things add up
to a considerable barrier to entry. There are many examples of trivial
pattern-matching languages (DOS wildcards, UNIX wildcards and SQL text
matching to name but three) that address both of these issues by being easy to
grasp and surprisingly useful.

I haven't actually tried this, but it seems easy to grasp and I've spent
enough time busting up delimited strings that "surprisingly useful" is
plausible too.

------
prezjordan
Love your stuff, laktek. This is a neat tool - I'll definitely being using it
for convenience sake.

------
zxw
Looks really nice, very readable. However, I want a python version!

~~~
prakashk
Not identical, but appears to be similar: templatemaker
(<https://code.google.com/p/templatemaker/>)

------
DannoHung
Congratulations, you've implemented a kind of crappy psuedo-CFG parser.

See SimpleParse (1) for an example of a recursive descent parser implemented
in Python accepting EBNF grammars (2).

I was talking with my friend at work today about the fact that most languages
don't seem to have a simple library for parsing data so people end up relying
on Regexes to get all their data extraction from strings done. He knows much,
much more about parsers than I do though and told me to shut up and build one
myself. That gave me pause.

1)
[http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Fo...](http://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_Form)
2) <http://simpleparse.sourceforge.net/>

~~~
helper-method
Your comment is the reason why I detest the community on HN. If I would have
put my personal time into this, and would read such a comment, I would find it
extremely demotivating. The solution may not be perfect but it would be much
more helpful if you constructive criticism instead of this.

~~~
oinksoft
I was noting this yesterday in the DollarDollarJS thread ... presumably in the
quest for "karma," flaming _open-source projects_ has really gotten popular
over the past eight months or so on this website (I think that flaming of
startup ideas has always been a feature, however).

