
Regex in Swift - hodgesmr
http://benscheirman.com/2014/06/regex-in-swift/
======
AaronFriel
There is probably a good reason Swift doesn't have regex literals, regex
operators, and other such things. These things are _not common_ in statically,
strongly typed languages with an emphasis on safety.

That could be pure correlation. Perhaps it is just coincidence that
JavaScript, PHP, Perl, and a handful of others happen to have a lot of
"stringly typed" code, message passing using strings as data structures, and
an emphasis on string and array operations.

Or it could be that such features, due to their ease of use and the turing-
tarpit of powerful enhanced regular expression languages, developers fall into
the allure and trap of stringly typed code. And safe languages are the
languages that don't make it exceedingly easy to use regular expressions.
Because there is no reason you cannot use regular expressions in safer
languages like Java, C#, Go, Rust, Haskell, or to be charitable, C++. It just
isn't a first-class citizen in those languages.

I'm of the view that if you make your language such that idioms prone to error
and bad practices easy, then developers will be prone to error and bad
practices.

tl;dr: Even as someone not invested in iOS or OS X development at all, given
the chance I'd veto these features. Give me types, not strings.

~~~
SiVal
Having amazing string capabilities is more important now than ever. But that
actually argues against including regex in the language syntax itself. You
want regular expressions to be able to evolve to become ever more powerful and
useful.

So just include a literal string type in the language itself--one that
minimizes the need for escapes and can be used for all sorts of protocols--and
use a regex library. The syntaxes of the library and the language can then
evolve somewhat independently, and if it reaches a point where you need a non-
backwards-compatible change in the regular expression syntax for some great
new innovation, people who still need the old one can continue to import the
old library, while others can import the new one.

The freedom of technologies to work together yet evolve independently is an
important "feature" that's worth protecting.

~~~
danudey
Aside from these reasons, you also have the case of performance optimization.
In Perl, for example, pretty much all string parsing (that I've ever seen done
in Perl code) is done via regular expressions. Regular expressions in Perl are
such a thing that most software I've used that uses regular expressions uses
the Perl-compatible regular expression library (libpcre).

The issue is that if you provide developers with a simple method of e.g.
splitting a string using regular expressions, then they will always split
their strings with regular expressions. This is rarely the most optimal way of
doing it, however, and it requires more memory and more overhead than e.g.
splitting a string by simply scanning it.

The reason this is a problem for Swift in particular is mobile devices, where
memory and CPU use is more costly than on desktop software.

I don't think it's coincidence that all the languages I'm aware of which
natively support regexes as part of the language syntax are
interpreted/scripting languages where performance is not the language's
primary concern (Python being one such language with this syntax notably
absent), whereas the language that the grandparent comment listed for 'safer
languages' which do not have regex literal support ("Java, C#, Go, Rust,
Haskell, or to be charitable, C++") are all compiled languages where
performance is assumed to be part of the primary concern for the language
design and for developers in the language.

~~~
benniw
"more memory and more overhead than e.g. splitting a string by simply scanning
it."

Regexes _are_ simply scanning strings, they're just a more condensed syntax
for specifying how the scanning should be done.

And in interpreted languages, using the built-in regex feature (which can
apply high-level optimizations) will virtually always give much _better_
performance than implementing the same logic by manually looping over the
string character-by-character with the language's interpreted for/while loops.

------
kenferry
The author is maybe not aware that there's already a convenience form that
doesn't require explicitly making an NSRegularExpression object.

    
    
        if name.rangeOfString("ski$", options: .RegularExpressionSearch).location != NSNotFound {
            println("\(name) is probably polish")
        }
    

That's existing Cocoa API; in Swift (hopefully!) the API can be updated to
return nil if there's no match, so that it can read

    
    
        if let match = name.rangeOfString("ski$", options: .RegularExpressionSearch) {
            println("\(name) is probably polish")
        }
    

which I don't think is too bad!

~~~
subdigital
Thanks, included this in the post.

------
eridius
This feature is crying out for procedural macro support, not for being built
into the language. For comparison, Rust has compile-time regular expressions
(which, I will note, this blog post does not do; it's all runtime-parsed),
implemented as a separate library that ships with Rust, using the procedural
macro support (also called syntax extensions). This means the compiler and the
language spec knows _nothing_ about regular expressions, and only the library
libregex knows anything about it, and if you don't link against libregex, your
program has no knowledge about it.

This ends up looking like the following:

    
    
        #![feature(phase)] // feature-gate for syntax extensions
        #[phase(plugin)] // tells the compiler the following crate has syntax extensions
        extern crate regex_macros; // a crate is a rust library. this one provides the syntax extension
        extern crate regex; // this one provides the runtime support for regular expressions
    
        fn main() {
            let re = regex!(r"^\d{4}-\d{2}-\d{2}$"); // compile-time regular expression
            assert_eq!(re.is_match("2014-01-01"), true);
        }
    

That `regex!(...)` call will trigger the compile-time syntax extension to
parse the regular expression, throw a compile-time error if the parsing fails,
and otherwise expand to an inline data structure that contains the runtime
representation of the parsed regular expression. Even better, it generates
native Rust code for various bits of the matching process, instead of relying
on the generalized implementation used for runtime-parsed regular expressions,
which means it's actually faster to use a compile-time regex. The downside is,
of course, that it's generating specialized code for each one, so this can
bloat your binary if you use a lot of regular expressions, but on the upside
turning on Link-Time Optimization can get rid of a lot of this overhead.

------
coldtea
From the radar issue submission:

> _Any modern language should natively support regular expression literals_

Regex literals add needless complexity to the language, and tie it with a
specific regex implementation, with no real benefit.

Just because Perl/JS/Ruby have this kludge, doesn't mean a modern language
"should" have it.

Now, a way to write unescaped strings (e.g not having to escape all the regex
operators like \ etc), that I can stand behind.

~~~
benniw
There _are_ real benefits, they're called convenience and compatibility.

And what's wrong with tying a language to a specific regex syntax? After all,
you're also tying it top a specific outside-of-regexes syntax.

Regexes are also code, they just happens to be written in a different sub-
language than the rest of the program.

~~~
coldtea
> _There are real benefits, they 're called convenience and compatibility._

Besides the fact that built-in regex literals are hardly any more "convenient"
than a function call (a few keystrokes saved at best), regexes shouldn't be
"convenient".

If anything, they should be discouraged. To quote JWZ: "Some people, when
confronted with a problem, think 'I know, I'll use regular expressions'. Now
they have two problems."

Oh, and "compatibility" doesn't come to play at all. Why would regex literals
be any more "compatible" (with what?) than a regex object/functions?
Compatible to what? JS and Ruby syntax?

> _And what 's wrong with tying a language to a specific regex syntax? After
> all, you're also tying it top a specific outside-of-regexes syntax._

For one, because you have to keep maintaining it forever, as part of the core
syntax, whereas with a library you could deprecate it. It's not like there
aren't several regex flavors, and approaches on how execute them (e.g:
[http://swtch.com/~rsc/regexp/regexp1.html](http://swtch.com/~rsc/regexp/regexp1.html)
).

~~~
benniw
Re "compatibility", I was thinking more about the language's library
ecosystem, that could benefit from a single official regex implementation that
can be relied upon and can be assumed to be known by programmers (when
designing library APIs, for instance).

But I actually forgot to mention the biggest benefit of having dedicated regex
literals (rather than just passing them as strings to some constructor at run-
time):

They will be treated as part of the language, and can be parsed and compiled
together with the rest of the code (a.k.a. at compile-time).

Which means:

1) Better performance

2) Syntax errors in regexes are compile-time errors, which is a _huge_ help in
keeping bugs out

3) Editors will show you syntax highlighting inside of regexes

It's hard to overstate the benefit of those three consequences; Once you've
worked with a language that supports such compile-time regex literals (e.g.
Perl) you don't want to go back to using regexes in a language that treats
them as strings (e.g. Python).

Regexes are code, and defining code in strings to be eval'ed at run-time is
just plain wrong.

------
zemo
> once Apple reads my radar and implements /regex/ literal syntax

jesus, get over yourself.

~~~
mopsled
That seemed tongue-in-cheek to me.

------
abecedarius
How about something like [http://www.inf.puc-
rio.br/~roberto/lpeg/](http://www.inf.puc-rio.br/~roberto/lpeg/) ?

Since the optional pattern syntax doesn't use backslashes, they aren't a
problem. I'd hack this up myself if Swift weren't Apple-only.

------
solomone
Custom operators for all things ! or how to make your code base unmaintainable

~~~
jonhohle
I've never understood this position. Why is a symbolic operator name so much
more difficult to maintain than a name restricted to [a-zA-Z0-9_]?

Over the last decade I've heard this repeated (and been stuck in languages
which don't support operator overloading), and by chance, the very first
project I chose to try to implement in Swift and I found a valid use for
operator overloading (manipulating coordinates in a simple 3D renderer).

    
    
        foo = foo.add(bar)
        baz = baz.div(car)
    

vs.

    
    
        foo = foo + bar
        baz = baz / car
    

or even better:

    
    
        foo += bar
        baz /= car
    

Unrelated to operator overloading, but taking some Objective-C 1.0 -> Swift
resulted in 50% fewer lines of code (and I'm guessing ⅓ the total characters).

edit: bug ;)

~~~
sparkie
> I've never understood this position. Why is a symbolic operator name so much
> more difficult to maintain than a name restricted to [a-zA-Z0-9_]?

Several reasons:

It's not obvious what an operator does, whereas names are descriptive. Some
operators are "obvious", because they're firmly ingrained into our culture -
(+) for addition is an example, it's almost universally understood to mean
that. On the other hand, where in tilde, ~, is it obvious that you meant to
test if a string matches a regular pattern? It's only obvious to descendants
of PERL, for the reason that Larry Wall decided arbitrarily to use it for
such. If every programmer invented his own operators instead of using the
common languages we share, we would get nowhere.

Indeed, operator overloads are really useful when you use them _the right way_
, as per your own example, but the flaw in allowing people to overload
operators arbitrarily is that they abuse them to mean something unexpected -
we all expect + to mean addition, but when it's used to concatenate two
strings, it's easy to get confused as to why the hell somebody thought that
was a good idea.

A big hurdle is even figuring out what an operator means if you've never
encountered it. Unless you have good IDE support to navigate to its
definition, or a specialized search engine for searching your language, then a
typical search is going to turn up naught. On the other hand, most search
engines understand [a-ZA-Z]+.

We could argue that using "+" for string concatenation "is obvious", at least
to other programmers - but only because they've encountered such usage before.
We can't reasonably expect to learn, recall, and fluently read arbitrary
operators for any conceivable operation you can think of. Well, unless you're
a fan of Control.Lens.Operators
([http://hackage.haskell.org/package/lens-4.1.2/docs/Control-L...](http://hackage.haskell.org/package/lens-4.1.2/docs/Control-
Lens-Operators.html)), then nobody else is going to read your code.

~~~
kbenson
Like everything, it's a trade-off. Here's an example of something I've been
envious of Haskell about for a while: [https://bitbucket.org/xnyhps/haskell-
unittyped/wiki/Examples](https://bitbucket.org/xnyhps/haskell-
unittyped/wiki/Examples)

There's times where it really does make stuff much more readable. It's about
finding and taking advantage of those times.

------
wyager
"Please tack on this random application-specific feature that I like"

Can we add supprt for overloading the comma operator? How about the mail
function from PHP?

~~~
x3ro
How are regular expressions "random" and "application-specific"? Admittedly I
do not have a source for this, but I'd assume that a large percentage of
programs employs Regexes somewhere. Every major language has Regex support,
often built-in [1].

Swift, among other things, aims to make things less verbose right?

[1]:
[http://en.wikipedia.org/wiki/Comparison_of_regular_expressio...](http://en.wikipedia.org/wiki/Comparison_of_regular_expression_engines)

~~~
wyager
> Every major language has Regex support, often built-in

That's a very bold statement. My most commonly used languages (Haskell, Rust,
Go, C, and C++) do not have regex support built in. I don't think any of them
even have regex support in their standard library.

>I'd assume that a large percentage of programs employs Regexes somewhere

I think that would be incorrect. I've never used a regex in a production
program. In fact, the only time I ever use regexes is when I'm scraping data
or doing searches, and that certainly doesn't account for "a large percentage
of programs".

A large percentage of programs also use URL fetching, but that's not a good
reason to make URL fetching part of the core language.

~~~
plorkyeran
Rust, Go and C++ all have regex libraries as part of the standard library.

~~~
wyager
OK, cool. It's still not built in to the language in any meaningful way. There
isn't any special syntax or rule-bending done for regexes.

