

An Improved Liberal, Accurate Regex Pattern for Matching URLs - blazamos
http://daringfireball.net/2010/07/improved_regex_for_matching_urls

======
silentbicycle
This is the point at which it's worth learning _actual_ parsing tools, rather
than just winging it with REs. REs are fine for tokenizing, but cannot handle
recursion, and quickly become clumsy for patterns made of distinct sub-
elements.

Once you sink deeper into that turing tarpit, you end up with monstrosities
like this (<http://www.ex-parrot.com/~pdw/Mail-RFC822-Address.html>), a RE for
matching valid email addresses.

~~~
jacobolus
Except the whole point of this thing, as clearly explained in the article, is
for everyone with access to a regexp implementation to be able to reuse the
same few-line regexp, as a drop-in replacement that works better than the the
shitty regexps they currently use to recognize URLs. (Examples of currently
bad matchers that might benefit from this code: hacker news’s, gmail’s)

When they get a tiny bit clumsy, but before they get so clumsy that adding a
bunch of parsing machinery is really worth the trouble, regexps are still the
best solution.

* * *

Markdown, on the other hand, John Gruber’s more famous project, would be much
improved (especially for people interested in extending it) by having its
specification written in terms of a real grammar.

~~~
silentbicycle
The article mentions that doing it better would require lots of nonstandard
extensions, but not that it's struggling in the first place because it's the
wrong tool for the job. If more developers realized the limitations of using
just REs for these tasks, languages would be better integrated with actual
parsing tools.

I think that tools like LPEG (<http://www.inf.puc-
rio.br/~roberto/lpeg/lpeg.html>) are a step in the right direction - it's a
small Lua library* which provides a PEG-based parsing library. It's very
powerful, but also scales down to use as a more efficient, more expressive
PCRE replacement. There are various trade-offs in using PEGs rather than
LALR(1), LL(1), etc., but the integration with the rest of the language is
very good. It _feels_ like using a better form of REs, rather than "adding a
bunch of parsing machinery".

* Though nothing about its design is Lua-specific. There's a paper which explains its underlying mechanism, and it's just a small C library (2258 loc for v. 0.9) - porting it wouldn't be that difficult.

I agree with you about an actual markdown grammar, though it's useful enough
that I have a hard time complaining. (I particularly like Discount
(<http://www.pell.portland.or.us/~orc/Code/discount/>).)

------
njharman
Why bother matching balanced parens? just match everything until nonencoded
space.

These are valid urls, no?

    
    
      example.com/(
      example.com/dkjflkj)sdkfj(/.
      example.com/.
      example.com//////:
      example.com/anycharacters_in_any_order_as_long_as_certain_ones_are_encoded
    

There's no way to tell if trailing punctuation is part of valid url or not.
You can assume trailing punc is not and chop it off. Which should be correct
99.999 of the time. Similar with surrounding braces,brackets,parens. If you
see one at start assume the one at end is not part of url.

~~~
drivebyacct2
Ironically that solution doesn't work for lots of wikipedia links that end in
a paren.

~~~
_delirium
There's actually a lot of Wikipedia articles that end with punctuation, which
I get bitten by at HN relatively frequently. For example, there are a bunch of
Supreme Court cases involving companies that end with an "L.L.C.", "Inc." or
"Co.", like: <http://en.wikipedia.org/wiki/Riegel_v._Medtronic,_Inc>.

------
santry
What's the purpose of matching "www.", "www1.", "www2." … "www999."? For
purposes of this regex, wouldn't the domain matching that follows be
sufficient?

~~~
mturmon
referring to

[http://daringfireball.net/misc/2010/07/url-matching-regex-
te...](http://daringfireball.net/misc/2010/07/url-matching-regex-test-
data.text)

you can see he wants to match things like www.example.com, but not
filename.txt

So www. can introduce a url also.

------
eli
But what about my .mueseum domains?

