This looks like a reasonable implementation of a Type-3 star-free grammar. I think it's a shame the author eschewed the idea of making it bijective with e.g. Python's string.format or Java's MessageFormat.format - is there a formatting library in Javascript that outputs strings from a "{var_name}" and "`var_name`" template style?
The linkbait title ("Too lazy to write Regular Expressions?") definitely rubs me the wrong way. The syntax for Perl-compatible regular expressions is horrendous - it's difficult to identify the isomorphism with the underlying ndfa, which in my view is the most intuitive representation, it's practically uncomposable, and it's far more dense with non-alphanumeric characters than any programming language other than something like J.
I guess what I was hoping for when I clicked this is what you might call a "lexer combinator" - something like (Python syntax):
Obviously there are opportunities to make that less verbose, though I don't know if you can ever make it as close to PCRE syntax as parser combinators get to EBNF--perhaps maybe you could do something like 'a'*1 to represent 'a'+ in PCRE. Does such a library exist?
And since it's pure-Scala, you get all the benefit of static typing (error detection, auto-complete, etc) when writing your parser. There's also no extra compile step: your parser compiles with the rest of your program and is just a java Object with a parseAll() method.
Lastly, it's a superset of PEG, which means that its behavior and limitations are all pretty well-studied academically and well-defined (which is more than can be said of most hacked-together regex solutions!).
Let's rewrite the examples using regular expressions (Python but that's not extremely important):
import re
re.match(r"^/(?P<year>.+?)/(?P<month>.+?)/(?P<day>.+?)/(?P<title>.+?)\.html$",
"/2012/08/12/test.html"
).groupdict()
re.match(r"^(?P<name>.+?) <(?P<email>.+?)> \((?P<url>.+?)\)$",
"John Doe <john@example.com> (http://example.com)"
).groupdict()
re.match(r"^from (?P<from>.+?) to (?P<to>.+?)$",
re.sub(r"\s+", " ", "from 4th October to 10th October")
).groupdict()
re.match(r"^convert (?P<quantity>.+?) (?P<from_unit>.+?) to (?P<to_unit>.+?)$",
"Convert 1500 Grams to Kilograms".lower()
).groupdict()
This didn't take long to write but even like this I would vastly prefer the proposed syntax of extract-values which I find much easier to remember and to write. So, yes, we can write regular expressions but why should we if for certain simple cases (and the rule is that most cases are simple) there is an easier way to do the same thing?
Think, for example, the built-in support for templates in Python (string.Template). Templates offer nothing that cannot not be done with the standard formatter but they are simpler to use. In a sense this proposal is the dual of templates, as it offers easy parsing for a restricted number of cases. And it definitely doesn't deserve all the negativity.
Having said that, I don't think I would use such facility unless it were part of the standard library. Writing a regular expression now and then is easier (at least for me) than having to keep around an extra library that, even with the help of pip and friends, I would have to re-install every time I change my environment (upgrade Python, setup a new machine etc.).
For Python libraries, I used to shy away from adding extra modules, just because I didn't want to install things in the system Python site-packages.
Then I started using virtualenv for everything, and my life has been much simpler. Now, I have a separate virtualenv setup for each project, and a simple script/Makefile to download and build dependencies. I also add a short shell script to wrap each project that automatically sets up the virtualenv environment and runs the appropriate Python script.
Needless to say, I'm very happy that virtualenv itself has been added to the standard library.
You can use tiny modules like this with near reckless abandon in node. Due to the way node's require() works and the way the package manager installs modules locally by default (rather than system-wide), most of the FUD of using 3rd party modules in other environments simply doesn't apply in node.
A neat idea, but I'm against hiding the underlying regular expressions from a user. They're such a powerful tool, anyone using them for matching or extraction should read up on them. That said, for simple use this is cool and probably something that would appeal to beginners.
We hide and abstract things in programming all the time.
For example, the proper (RFC-compliant) regex for an email is very complicated and often implemented the wrong way.
As somebody who has written a lot of regexes, I'd rather uses this library which has the correct abstraction than google the regex for http-urls for the 512th time.
This library doesn't help you here though. It doesn't bring you any abstractions on top of regexes (e.g. parsing URLs/emails). This is merely a different syntax for matching parts of a string.
I hope no one is seriously suggesting that regular expressions should be used in programs, such as email address verification and http parsing. As well as being incredibly slow, they are hard to read, and inferior to application-specific parsers (for example, sometimes non-RFC-compliant emails are actually valid, and dealing with whitespace in html is a nightmare).
For the interactive case, the ability for them to be written quickly is what makes them so helpful, and an abstraction library could take away this advantage.
3. Sometimes a regex is faster than the overhead of a parser, so wouldn't the choice be dependent on context? In other words, regexes are not always slower, true?
4. Wouldn't some abstraction libraries utilize regexes under the hood? Would that be wrong in your view?
P.S. Some languages allow the option for very readable regexes, e.g. separate each component on its own line, with a comment.
Ah, understood. I was thinking "valid == well-formed" without knowing whether it really works (i.e., could be deleted), whereas I see you rightfully point out that it more reasonably means "it works." Thank you, makes sense.
Simply, some sites don't enforce the full set of RFC rules, as such people actually have non-RFC-compliant email addresses that are valid.
How can you 'compile' a regular expression?
For very simple regular expressions, they might be decently fast, but as soon as you start pulling out the more complicated regular expressions needed for parsing, you get slower. Even simple repeats can have a lot of overhead if not used correctly, have a look at "Looking Inside The Regex Engine" at this link http://www.regular-expressions.info/repeat.html. An equivalent parser doesn't need to do any form of backtracking, and doesn't care about the structure. For example, I've seen an application use regular expressions for html parsing. After spending a while figuring out what they actually did, I found the source html had changed its whitespace, but not the DOM structure, which broke the regular expressions.
As for my reasoning above, I think a lot of 'abstraction' libraries would be faster by operating directly on the data, instead of just converting it to regular expressions. The beauty of regular expressions is the speed at which they can be written.
You lose the ability to write them quickly, which considering I mainly use them for code editing or one-off uses, is quite important. The thing about regular expressions is they aren't suited to general programming in their normal form (since they are hard to read, and slow), and the quoted example is better described as a general parser, and could be implemented much better directly then abstracting through regular expressions.
Cool, I love having stuff like this out there. It may not always be the right tool for a given problem, but sometimes it will be. I imagine I'll have cause to use it sometime this year, so thanks!
In particular, I could see this being useful in a situation where you want the validation "rule" to be readable by non-programmers (perhaps if building a system where an end-user can choose amongst a set of canned validation rules for custom fields that they can add to something-or-other). In such a case, this simple grammar is something you could display in whatever "configurator" UI the user has. In general I like stuff that both programmers and non-programmers (or just a programmer that doesn't feel like being a programmer right then) can understand.
RegExs...fun to figure out and to write the first time. Pain to debug and supports x-number of months in the future, especially by other developers on anything more than the common cases.
There are some things that can help with this:
0. Consider whether string.split could do part of the job instead.
1. Consider writing unit tests for your regexes.
2. Split your regex onto multiple lines.
3. Turn on the option that lets you include comments inside your regex.
I do agree that really complicated regexes get to be a pain.
I try to assuage my pain and that of my devs by actually....commenting the REGex. The comment explaining things is often longer than the regEx, but helps quite a bit.
I'm not at all convinced by this, I'm afraid. Regexes are intimidating at first, but when you get the hang of them they are a fabulously powerful tool that absolutely every programmer should learn.
As for 'laziness', I don't buy that regexes take any longer to write than the linked method. The third example can be done (in python, but I'm sure the ruby equivalent is just as simple) with:
re.sub(r'from\s+(\w+)\s+(\w+)\s+to\s+(\w+)\s+(\w+)', r"{'from': '\1 \2', 'to':'\3 \4'}", "from 4th October to 10th October")
Which is only ten characters longer, requires no external dependencies and can be used with very little modification in (almost) every other language.
Regexes have one huge flaw: the syntax. If they had whitespace in them (like CoffeeScript), and required quoting the characters you were looking for (to distinguish from syntax), then paired with a little syntax highlighting they would be far easier to read and write.
Ruby (borrowed from Perl and common in many regex impls I believe) has the "x" modifier to ignore whitespace and comments in literal regexes, for example:
str = "2012-10-04T07:20:00"
str =~ /
(\d{4}) # four digit year
-
(\d{2}) # two digit month
-
(\d{2}) # two digit day
T
(\d{2}) # two digit hour
:
(\d{2}) # two digit min
:
(\d{2}) # two digit sec
/x
Here's something equivalent to the example in the OP:
#!/usr/bin/env perl
use Template::Extract;
use Data::Printer;
sub extractValues {
my ($document, $template, $opts) = @_;
my $delims = $opts->{delimiters} || [ qw/ { } / ];
$document = lc $document
if $opts->{lowercase};
if (exists $opts->{whitespace}) {
my $ws = ' ' x $opts->{whitespace};
$document =~ s/\s+/$ws/g;
}
Template::Extract->new( { START_TAG => $delims->[0], END_TAG => $delims->[1] } )
->extract( $template, $document );
}
p extractValues('/2012/08/12/test.html', '/{year}/{month}/{day}/{title}.html');
p extractValues('John Doe <john@example.com> (http://example.com)', '{name} <{email}> ({url})');
p extractValues('from 4th October to 10th October',
'from `from` to `to`',
{ whitespace => 1, delimiters => ['`', '`'] });
Output:
\ {
day 12,
month 08,
title "test",
year 2012
}
\ {
email "john@example.com",
name "John Doe",
url "http://example.com"
}
\ {
from "4th October",
to "10th October"
}
eh. I use regular expressions often enough that I am pretty fast at them for common cases, and usually get them right on the first try. This seems like it would only be useful if you don't really use regex's that often.
For the case of users, I don't think I would trust most users (as the article suggests) to writing their own pattern matching.
Quite right. I see two issues with regexes: funky syntax and some important and non-obvious limitations. If you want to use them, these two things add up to a considerable barrier to entry. There are many examples of trivial pattern-matching languages (DOS wildcards, UNIX wildcards and SQL text matching to name but three) that address both of these issues by being easy to grasp and surprisingly useful.
I haven't actually tried this, but it seems easy to grasp and I've spent enough time busting up delimited strings that "surprisingly useful" is plausible too.
Congratulations, you've implemented a kind of crappy psuedo-CFG parser.
See SimpleParse (1) for an example of a recursive descent parser implemented in Python accepting EBNF grammars (2).
I was talking with my friend at work today about the fact that most languages don't seem to have a simple library for parsing data so people end up relying on Regexes to get all their data extraction from strings done. He knows much, much more about parsers than I do though and told me to shut up and build one myself. That gave me pause.
Your comment is the reason why I detest the community on HN. If I would have put my personal time into this, and would read such a comment, I would find it extremely demotivating. The solution may not be perfect but it would be much more helpful if you constructive criticism instead of this.
I was noting this yesterday in the DollarDollarJS thread ... presumably in the quest for "karma," flaming open-source projects has really gotten popular over the past eight months or so on this website (I think that flaming of startup ideas has always been a feature, however).
> most languages don't seem to have a simple library for parsing data
This is what lex and yacc (etc) are for, but they aren't quite simple. Arguably, XML has largely replaced these for "parsing data", since DTDs and XML Schema are CFGs with some minor restrictions (e.g. 1-unambiguous). In many cases, you don't even need a CFG, just nested lists is enough, so JSON fills that role. For even simpler data, just lists of values, there's csv and unix-style line-based formats (use awk etc for parsing).
Interest in CFG's seems to have decreased dramatically since XML (and later JSON) met the simpler needs it served.
Or a simple RFC6570 parser (URI Template [1]) that supports variable extraction. Variable extraction isn't part of the spec [2], but at least one library implements it [3].
The linkbait title ("Too lazy to write Regular Expressions?") definitely rubs me the wrong way. The syntax for Perl-compatible regular expressions is horrendous - it's difficult to identify the isomorphism with the underlying ndfa, which in my view is the most intuitive representation, it's practically uncomposable, and it's far more dense with non-alphanumeric characters than any programming language other than something like J.
I guess what I was hoping for when I clicked this is what you might call a "lexer combinator" - something like (Python syntax):
Obviously there are opportunities to make that less verbose, though I don't know if you can ever make it as close to PCRE syntax as parser combinators get to EBNF--perhaps maybe you could do something like 'a'*1 to represent 'a'+ in PCRE. Does such a library exist?