Although Rebol can be used for programming,
writing functions, and performing processes,
its greatest strength is the ability to
easily create domain-specific languages or
dialects.
— Carl Sassenrath [Rebol author]
There have been many efforts similar to this in many languages, but most of us seem happy to stick to the more succinct canonical form, supplemented via /x # comments when things get too hairy
Generally, I find that if one's regexes are so complex that one needs visualizers or other aids in writing them, one doesn't have a regex problem, but a parsing problem. The method of parsing by recursive descent can often lead to much more understandable (if more verbose) "pattern matching".
The worst regexes I've had to write involved parsing the various IMDB data files, which seem to have been formatted specifically to make them as difficult to parse as possible. I hear mediawiki syntax is similarly arcane and evil, but I've never tried to parse it (though last night I started writing some tools to deal with wikipedia dumps so I might end up in that corner). I'd really like to see different approaches to parsing really ugly formats that feature an exception to almost every single pattern you think you've found. I honestly think the regex is easiest...
Looks like Linq (from .Net/C#). Pretty sexy way to write Regular Expressions if you ask me.
I've "learned" regular expressions multiple times but it just never sticks, I have no idea why. It certainly doesn't help that there are several different incompatible syntaxes (so what I remember and think "should" work doesn't).
I'd prefer to write RegX's in this style, however I would pay attention to performance (not that Regular Expressions are high performance, however I wouldn't want to see a large performance loss either).
Regular expressions are high performance if you use automata style(Regular Language) regular expressions, which limits the use of some of the features you can use.
Modern regular expression engines in a lot of languages, actually go beyond the expressiveness of a regular language. This is what damages performance.
There is no reason why this would reduce performance... if its not doing anything crazy.
If anything your taking work away from it. Your building the tree directly here, where as parser would normally build a tree from the string. But since this is integrating into the languages RE library i'm guessing its writing that tree as a string, which is then passed into the regular expression engine, to be turned into a tree again :)
I guess it depends on your definition of "high performance."
If a regular expression runs too often, even pre-compiled (as they should be), you'll want to replace them with code written in the native language. I've gone in and replaced a one line search/replace written in RegX (compiled), with just a C-style for() loop over the wchar array, and had the memory usage drop by near 80% and performance increase by over 60%.
So high performance is all relative. However RegX isn't something I'd describe that way, even compiled. It is a nice way to write complex string parsing code quickly however.
If your regex is replaceable by a simple "find_substring" or equivalent, it's slow.
If your regex is complicated, it will probably beat any naive attempt to write it into conventional string processing, short of reimplementing regexs in the first place. Especially since in many languages, "conventional string processing" may involve the creation of lots of copies and sub-copies.
> you'll want to replace them with code written in the native language
Probably not true for Javascript (and other scripted languages) - matching regex uses native and highly optimized regex lib, which will usually be orders of magnitude faster than implementing this in the language.
A regular expression implemented as a DFA would literally be looping over the string, and a state transition table. I don't see how performance could be bad.
It is highly dependent on the regular expression engine you use, most don't use automata because of extra features.
Perl 6 unifies "regexes" and recursive descent grammars at the syntax level and then compiles them to a hybrid of DFAs, NFAs, and regular code as necessary. The idea is to maintain the simplicity of simple regexes and of parser combinators for the user but retain as much of the performance benefits of true DFA-able regular expressions as is possible.
At least, that's the theory. In practice, while the benefit of syntactic usability is available today, the Perl 6 rules engine is still very slow and it'll likely take years to optimize the heck out of this approach and really harvest the performance benefit.
If you're interested in something similar for .NET / C#, check out my Regextra library, specifically the Passphrase Regex Builder: https://github.com/amageed/Regextra
As the name suggests though, the focus was on passphrase criteria and it wasn't to produce a DSL for general regex building. The library also supports named templates and a few utility methods.
Performance is unaffected. This provides a fluent and verbose way of building a regular expression.
Users of the library then feed the built regular expression into their standard regular expression engine.
This is why I dislike the design of Linq. The pattern of chaining function calls to implement a DSL is common enough that they should have employed a general solution, not just a wonky SQL-specific version.
LINQ isn't SQL-specific and does apply generally. It can be used against the standard .NET framework objects and collections. There are different LINQ focuses or flavors, and there are 2 ways to write queries. There's LINQ to Objects, LINQ to XML, LINQ to SQL (no longer actively maintained; nowadays Entity Framework is the Microsoft alternative), and you can write your own LINQ providers to target other purposes.
As for syntax, there's the fluent syntax (chained methods), and there's the query syntax which is syntactic sugar that gets compiled to the methods. The query syntax is probably the biggest reason people mistake LINQ for being SQL specific since it resembles SQL.
E.g.,
var results = SomeCollection.Where(c => c.SomeProperty < 10)
.Select(c => new { c.SomeProperty, c.OtherProperty });
The same thing in query syntax:
var results = from c in SomeCollection
where c.SomeProperty < 10
select new { c.SomeProperty, c.OtherProperty };
Then you can iterate over both the same way:
foreach (var result in results)
{
Console.WriteLine(result);
}
Thanks, this is a lot better than writing this (even if the formatting worked here):
```
(?xi)
\b
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)
```
actually most of the comments seem to imply that whoever wrote that don't fully understand regexp syntax -- or, worst, she expects that whoever read will not
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or "%";
wait, I thought about it for a second and I see what you meant. You're not saying it's wrong, you're saying it's obvious.
I wasn't sure if it was obvious because I wasn't sure if {1,3} was supposed to be {1-3} and there was a mistake in the expression, or if there was some kind of unexpected error in the [a-z0-9%] expression.
Because even in this simple example, there is room for error.
(?xi)
\b
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
# (Trying not to match e.g. "URI::Escape")
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)
That really is Hacker News' worst limitation. I understand if they want to limit what formatting is available, but the fact that basic listing is so clunky is annoying.
Regular expressions are a natural fit for construction of regular expressions.
Look, I know it takes a while, but once you get the hang of it, you won't need any crutches to write regular expressions. The only tool that's really needed is a way to rigorously test a regular expression to make sure it does what it needs to do and there are a ton of those around.
No, they're really not, as evidenced by all the quoting and meta-character nonsense you have to deal with. Sure, it's not too difficult to figure out, most of the time, but I think a solution that puts characters and logic on different quoting levels will almost always be better from an expressiveness standpoint (ignoring ecosystem issues).
This is usually borne by the string literal being used to express the regular expression literal syntax in many languages. Perl, for example, has a regular expression literal syntax that is part of the language proper (which has the added benefit that non-dynamic regular expressions can be checked for syntax at compile time). Python, in contrast, doesn't have a first-class regular expression literal, but makes it easier to deal with by prefixing the literal with r or R to create a "raw string" (which exists to avoid excessive backslash escaping). Some regular expression engines use % as the meta-character indicator, which is more compatible with C-style "escape sequences" in double-quoted strings).
If you think characters and logic need to be on different quoting levels, you're not taking the right perspective on regular expressions. \d or \w are not an escaped d or w, they are their own atoms (or "the keywords of the language", if you will), distinct from the atoms that match the ASCII characters 0x64 and 0x77. The thing to remember with regular expressions is always the first lesson presented: (non-meta) characters match themselves, the regular expression /a/ matches the letter a. What's implied here, but rarely said, is that that's not really the letter a in there, but rather an expression that matches the letter a—it just so happens to also look like the thing it matches. This distinction is subtle, but important. This can also be made more evident by using the /x modifier if it's available to spread out the individual expressions (put space between the keywords).
The primary difference in regular expression languages is often how "logic", as you call it, is expressed. PCRE considers, for example, [ to be the character for opening a character class and \[ to match the byte 0x5b. Admittedly, this is confusing when switching engines because 1) not every character matches itself (the expression that matches a character and the character it matches are not visually the same) and 2) other RE engines have taken the opposite approach depending on if that engine was meant, by the author, to have more literal atoms or more logic in its most common use (that is, you save typing if you mean to match the byte 0x5b more frequently than if you mean to open a character class).
As for "quoting", you almost NEVER should be using things like PCRE's \Q…\E (or the quotemeta function) unless you're building regular expressions dynamically from user-input. quotemeta and friends are not readability tools, but safety tools.
I'm using the term "quoting" in the general sense of a marker that some sequence of symbols is being used as symbols, rather than for their semantic values.
My perspective on regular expressions in one of a student who was not two weeks ago introduced to the formal version of REs. In this formalism, there are basically strings and operators on these strings. We don't usually use quotes, but only because you can usually infer from context which bits are strings and which are one of the small set of operators. But when we need to match numbers with possible "+"es (the alternation operator) in front of them, out come the quotes.
In a typical programming language, we don't have the luxury of expecting the interpreter to infer things like that from context. Further, it's rather common to try to match things that would otherwise be used as metacharacters. This is exactly why quoting, in the general sense, was invented, so we can tell what's the program and what's the input.
Granted, most of my RE experience is in Python, where everything is just jammed in a string. There it's obvious that metacharacters and escapes are just a worse-is-better substitute for quasiquoting. Maybe it's different in Perl, but I'm skeptical. Strings matching themselves is cool. The problem is that it's cool enough to prevent you from realizing when you've taken the metaphor too far.
I agree with you. Every now and then I see mentions of "all-new-regex-builder" on HN frontpage. What is up with regex and desire to write wrappers upon wrappers on top of it?
I see regex like that: if you have to use it often enough, better to learn it as it is - will be more helpful in the long run. If you don't use regex too often then just google your question - there's a very high chance that somebody already wrote regex for your or similar problem.
Only tools I ever use are regex testers (like regexr.com) when I need to make sure that pattern works correctly.
It's not a "crutch", it's an "alternative". Couching it in negative terms isn't really fair.
While I prefer writing regexes, a regex DSL isn't fundamentally better or worse, just different. In addition, it allows non-computer people to write, or at least specify, regexes in a way that makes more sense to non-developers.
Alternate representations of regexes aren't necessarily a crutch to avoid learning the normal syntax. S-expressions in particular could be useful for runtime manipulation or generation of patterns without the bother of string mangling. (I can't think of a reason to do so off-hand, but it's a nifty capability.)
Regexpes exist to avoid cumbersome code like this, to make it less error prone. Makes me sad to see so many upvotes.
I get that some people have a hard time understanding regexpes with all the backtracking and greediness. Yes, syntax is a bit complicated. Maybe simplified predictable default mode could help. But there is no problem with DSL being used as an abstraction. In fact, we need more DSLs, for everything!
Some parse refs: http://en.wikibooks.org/wiki/REBOL_Programming/Language_Feat... | http://www.rebol.net/wiki/Parse_Project | http://www.rebol.com/r3/docs/concepts/parsing-summary.html