

A Liberal, Accurate Regex Pattern for Matching URLs - prakash
http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

======
bootload

      ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
    

I found the w3 specs (rfc3987) suitable for my needs ~
<http://www.ietf.org/rfc/rfc3987.txt> a nice Regex to parse Url formats. This
Regex allows you to extract scheme ($2), authority ($4), path ($5), query ($7)
and fragment ($9) ~ <http://www.flickr.com/photos/bootload/238916518/>

There are problems I've seen with using Regex strings and expecting them to
work in all cases on all Regex engines which is why I tend to stick with PCRE
~
[http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Express...](http://en.wikipedia.org/wiki/Perl_Compatible_Regular_Expressions)
a point in favour of the Gruber example.

 _"... The pattern is also liberal about Unicode glyphs within the URL ..."_

PCRE supports Unicode but it's not switched on by default ~
<http://www.pcre.org/pcre.txt>

------
nanotone
On the penultimate paragraph, and somewhat of a tangent:

Wow, I'd completely forgotten that you could have Unicode in domain names, and
I suspect a lot of people don't think about it very much either. In my limited
experience, even Chinese-only websites rarely stray from normal alphanumeric
domains, even though the people visiting those sites could easily type out
URLs with Chinese glyphs.

Perhaps I'm missing something here, but it seems that with good alphanumeric
domains becoming less available, cool/clever/classy Unicode domains could be a
viable alternative, given an appropriate purpose -- Google would probably not
want one -- and a techie enough audience. When [for which sites?] and how
often do people actually type URLs?

Example: a friend of mine did a cheeky web branding project a while ago named
"Heart Star Heart"... ♥★♥.com would have been perfect.

EDIT: I should probably do more research on this myself, but it looks like
there's some mysterious isomorphism between Unicode domains and "normal"
domains. Firefox renders U+272A in <http://✪df.ws/e7m> correctly but changes
its text to <http://xn--df-oiy.ws/e7m> and when I access ♥★♥.com my ISP
complains that xn--p3hxmb.com doesn't exist. Anybody know what the isomorphism
actually is?

~~~
sorbits
URLs do not allow unicode but most user agents support
<http://en.wikipedia.org/wiki/Internationalized_domain_name>

That said, non-ASCII URLs suck because not everyone can type them. Imagine
being a tourist in Tokyo who has to lookup a restaurant on your laptop or
having to lookup the product page for this gadget you bought in China…

~~~
nanotone
Right. As I noted above, it's absolutely not acceptable for some situations,
particularly those where you want lost or confused people to look you up. But
I can still think of plenty of other situations, and was merely pointing out
the disparity between the number of Unicode URLs I've encountered and the
number I'd expect to have encountered, given all the possibilities.

------
ehsanul
I was using a monster of a regex for validating URLs in a Ruby (Sinatra) app,
and it wasn't even looking in unstructured text). Found it at
[http://snipplr.com/view/6889/regular-expressions-for-uri-
val...](http://snipplr.com/view/6889/regular-expressions-for-uri-
validationparsing/)

/^(https?):\/\/((?:[a-z0-9.\\-]|%[0-9A-F]{2}){3,})(?::(\d+))?((?:\/(?:[a-z0-9\\-._~!$&'()
_+,;=:@]|%[0-9A-F]{2})_ ) _)(?:\?((?:[a-z0-9\\-._~!$
&'()_+,;=:\/?@]|%[0-9A-F]{2}) _))?(?:#((?:[a-z0-9\\-._~!$
&'()_+,;=:\/?@]|%[0-9A-F]{2})*))?$/i

Yeah, more involved. Though it parses it into the url parts, and it does work.

~~~
wingo
It looks like line noise. But it's funny though, I can read regexps better
than I can read formal semantics. Having tried once this evening to read
<http://matt.might.net/papers/might2007diss.pdf>, regexps are refreshing :)

~~~
mahmud
Please do not link to random papers on semantics, you don't know whose weekend
you might ruin.

Horrible nerd-snipping there; I grok half of the paper, and have no choice but
to study the formal semantics of the other half :-(

------
durin42
Adium also has a pretty insane way of matching URLs, that's been in use (and
growing all the while) since 2004.

[http://hg.adium.im/adium-1.4/file/542aa252713b/Frameworks/Au...](http://hg.adium.im/adium-1.4/file/542aa252713b/Frameworks/AutoHyperlinks%20Framework/Source/AHLinkLexer.l)

is the lexing part, and then there are other files in the same directory that
do other little bits. The whole hyperlinks framework is under a BSD license.

------
philfreo
I generally dislike when periods are taken as part of the URL. Then you can't
end a sentence with a URL like <http://example.com>.

HN does it right, but Gruber's example seems to put the period in the URL.

~~~
tsetse-fly
In the case of

<http://en.wikipedia.org/wiki/O3b_Networks,_Ltd>.

HN does it wrong. There are exceptions either way, I wouldn't say that one is
more correct.

------
notauser
Great stuff, my URL matching regex was very limited.

For e-mails I use:

    
    
      /\b[-a-z0-9~!$%^&*_=+}{\'?]+(\.[-a-z0-9~!$%^&*_=+}{\'?]+)*@([a-z0-9]([-a-z0-9_]?[a-z0-9])*(\.[-a-z0-9_]+)*\.(aero|arpa|biz|com|coop|edu|gov|info|int|mil|museum|name|net|org|pro|travel|mobi|[a-z]{2})|([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})(\.([1]?\d{1,2}|2[0-4]{1}\d{1}|25[0-5]{1})){3})(:[0-9]{1,5})?\b/ig
    

And for twitter user names I use:

    
    
      /\B@\w+\b/ig
    

(which incorrectly matches @@username as @username but I assume that kind of
thing is a typo - the important thing is not to match e-mail addresses)

~~~
jerf
For validating emails, I've settled on /.@./, or if you really want to push
for valid emails, /.@[^.]+\\../. (Note the lack of anchoring to the beginning
or end.) (That, and some limit on length.)

The rules are so flipping complicated and so easy to get wrong that you're
better off just trying to send a mail and seeing what happens, and asking the
recipient to validate reception if you care about the address. Is it really
_that_ important to exclude bad emails, at the cost of, say, blocking email
addresses from the UK, as your regex seems to do? Even "validating" for sheer
user error is only useful if you get it _right_.

~~~
cschneid
I like soft validation for emails. "This doesn't look like an email address,
verify you typed everything correctly, and resubmit". That way you handle
legit typos, without hassling people who have weird emails (gmail plus signs
and such).

------
kprobst
What would be the equivalent in Python to the :punct: character class
operator? I don't think the re module supports those. I guess they'd have to
be spelled out pretty much?

~~~
kingkilr
I guess that would be re.escape(string.punctuation), I've never looked/thought
about it though.

~~~
kprobst
Figured as much after I played around with making this work. Thanks!

------
techiferous
Wow, I just had this very problem a few days ago for an entry I submitted to
CodeRack! <http://coderack.org/users/techiferous/entries/90-racklinkify>

(Note: you can't plug-n-play this middleware yet--still a coupla bugs. Will
fix soon.)

------
whalesalad
It doesn't work with standard permalinks that feature hyphens in the url, and
none of his examples show links with hyphens. Most blogs out there (wordpress)
are using hyphens in their permalinks.

------
DanBlake
This is great. The regex we use on tinychat for URLs is self made and not as
all inclusive as this.

