
In search of the perfect URL validation regex - lgmspb
http://mathiasbynens.be/demo/url-regex
======
to3m
If you're going to allow dotted IPs you should really allow 32-bit IPs too,
e.g., [http://0xadc229b7](http://0xadc229b7),
[http://2915183031](http://2915183031) and
[http://025560424667](http://025560424667). (The validity of this last one was
news to me I must admit.)

~~~
zAy0LfpBZLC8mAC
None of those is a URI, so a URI validator most certainly should not accept
them. Just because browsers tend to understand them as a matter of a
historical accident does not mean those are valid URIs, just as tag soup that
browsers also tend to understand isn't valid HTML either.

~~~
mathias
Exactly. The goal was to come up with a good regular expression to validate
URLs as user input. There’s no way I’d want to allow alternate IP address
notations.

------
TazeTSchnitzel
Why use a regex? It's much simpler to write a URL validator by hand, speaking
as someone who wrote a URL parser,[1] and fixed a bug in PHP's.[2]

Or, you know, use a robust existing validator or parser. Like PHP's, for
instance.

[1] [https://github.com/TazeTSchnitzel/Faucet-HTTP-
Extension](https://github.com/TazeTSchnitzel/Faucet-HTTP-Extension) \-
granted, this deliberately limits the space of URLs it can parse, but it's not
difficult to cover all valid cases if you need to

[2] [https://github.com/php/php-
src/commit/36b88d77f2a9d0ac74692a...](https://github.com/php/php-
src/commit/36b88d77f2a9d0ac74692a679f636ccb5d11589f)

~~~
6cxs2hd6
Exactly. Isn't this as bad an idea as trying to parse HTML with regular
expressions? [1]

[1]: [http://stackoverflow.com/questions/1732348/regex-match-
open-...](http://stackoverflow.com/questions/1732348/regex-match-open-tags-
except-xhtml-self-contained-tags/1732454#1732454)

~~~
zAy0LfpBZLC8mAC
No, it's not, URIs are regular, so using _regular_ expressions is perfectly
fine.

~~~
6cxs2hd6
Good point. You're right, it's not _as_ bad an idea.

(I still think it's not a great idea. Being regular isn't necessarily the same
as being parseable with a maintainable regex.)

------
Dylan16807
Why are [http://www.foo.bar./](http://www.foo.bar./) and [http://a.b--
c.de/](http://a.b--c.de/) supposed to fail?

The @stephenhay is just about perfect despite being the shortest. The
subtleties of hyphen placement aren't very important, and this is a dumb place
to filter out private IP addresses when a domain could always resolve to one.
Checking if an IP is valid should be a later step.

~~~
saraid216
The trailing dot makes it an invalid URL, which defines hostname as *[
domainlabel "." ] toplabel.

"b--c" seems to be a valid domainlabel, though, so I'm not sure why it's on
there.

~~~
LukeShu
> The trailing dot makes it an invalid URL, which defines hostname as * [
> domainlabel "." ] toplabel.

I disagree. But, this is a tricky one. The relevant specs are:

    
    
        Spec               |          | Validity | Definition
        URL      (RFC1738) | obsolete |  invalid | hostname = *[ domainlabel "." ] toplabel
        HTTP/1.0 (RFC1945) | current  |  invalid | host     = <A legal Internet host domain name
                           |          |          |             or IP address (in dotted-decimal form),
                           |          |          |             as defined by Section 2.1 of RFC 1123>
        HTTP/1.1 (RFC2068) | obsolete |  invalid | ; same as RFC1945
        HTTP/1.1 (RFC2616) | obsolete |    valid | hostname = *( domainlabel "." ) toplabel [ "." ]
        URI      (RFC3986) | current  |    valid | host     = IP-literal / IPv4address / reg-name
                           |          |          | reg-name = *( unreserved / pct-encoded / sub-delims )
        HTTP/1.1 (RFC7230) | current  |    valid | uri-host = <host, see [RFC3986], Section 3.2.2>
    

The only way that URL is invalid is if we are in a strict HTTP/1.0 context.

As a note about RFC1738 being obsolete: these days a URL is just a URI (1)
whose scheme specifies it as a URL scheme, and (2) is valid according to the
scheme specification.

As the given URL is a valid URI, and is valid according to the current http
URL scheme specification (RFC7230), that URL is valid.

~~~
mathias
You forgot the most relevant spec, the URL Standard:
[http://url.spec.whatwg.org/](http://url.spec.whatwg.org/)

------
eli
At best this lets you conclude that a URL _could_ be valid. Is that really
useful? Is the goal here to catch typos? Because you'd still miss an awful lot
of typos.

If you really want your URL shortener to reject bad URLs, then you need to
actually test fetching each URL (and even then...)

As an aside, I'd instantly fail any library that validates against a list of
known TLDs. That was a bad idea when people were doing it a decade ago. It's
completely impractical now.

~~~
mathias
My exact use case was the following: the user clicks a bookmarklet that passes
the current URL in the browser as a query string parameter to a URL shortener
script. The validation is then performed before the URL is shortened.

In that scenario, and with the given requirements, I can’t think of a case
where the validation fails. There’s no need to worry about protocol-relative
URLs, etc.

(Keep in mind that this page is 4 years old — I very well may have missed
something.)

> If you really want your URL shortener to reject bad URLs, then you need to
> actually test fetching each URL (and even then...)

I disagree. [http://example.com/](http://example.com/) might experience
downtime at some point in time, but that doesn’t mean it’s suddenly an invalid
URL.

> As an aside, I'd instantly fail any library that validates against a list of
> known TLDs. That was a bad idea when people were doing it a decade ago. It's
> completely impractical now.

Agreed.

~~~
eli
I still don't quite follow the purpose of the validation. Is it against
malicious use? In normal use, I would think that pretty much any URL that's
good enough for the browser sending it would be good enough for the link
shortener.

~~~
mathias
But then you might end up shortening things like `about:blank` by accident.

------
zAy0LfpBZLC8mAC
WTF? When will people finally learn to read the spec and implement things
based on the spec and test things based on the spec instead of just making up
themselves what a URL is or what HTML is or what an email address is or what a
MIME body is or ...

There are supposed URIs in that list that aren't actually URIs, there are
supposed non-URIs in that list that are actually URIs, and most of the
candidate regexes obviously must have come from some creative minds and not
from people who should be writing software. If you just make shit up instead
of referring to what the spec says, you urgently should find yourself a new
profession, this kind of crap has been hurting us long enough.

(Also, I do not just mean the numeric RFC1918 IPv4 URIs, which obviously are
valid URIs but have been rejected intentionally nonetheless - even though
that's idiotic as well, of course, given that (a) nothing prevents anyone from
putting those addresses in the DNS and (b) those are actually perfectly fine
URIs that people use, and I don't see why people should not want to shorten
some class of the URIs that they use.)

By the way, the grammar in the RFC is machine readable, and it's regular. So
you can just write a script that transforms that grammar into a regex that is
guaranteed to reflect exactly what the spec says.

~~~
mathias
The goal was to come up with a good regular expression to validate URLs in
user input, and not to match any URL that browsers can handle (as per the URL
Standard). I am fully aware that this is not the same as what any spec says.

> By the way, the grammar in the RFC is machine readable, and it's regular.

The RFC does not reflect reality either (which, ironically, is what you seem
to be complaining about). If you’re looking for a spec-compliant solution, the
spec to follow is [http://url.spec.whatwg.org/](http://url.spec.whatwg.org/).

> If you just make shit up instead of referring to what the spec says, you
> urgently should find yourself a new profession, this kind of crap has been
> hurting us long enough.

I am aware of, and am a contributor to, the URL Standard:
[http://url.spec.whatwg.org/](http://url.spec.whatwg.org/) That doesn’t mean
there aren’t any situations in which I need/want to blacklist some technically
valid URL constructs.

~~~
zAy0LfpBZLC8mAC
> The goal was to come up with a good regular expression to validate URLs in
> user input, and not to match any URL that browsers can handle (as per the
> URL Standard).

WTF? What is "validation" supposed to be good for if it doesn't actually
validate what it claims to? Exactly this mentality of making up your own stuff
instead of implementing standards is what causes all these interoperability
nightmares! If you claim to accept URLs, then accept URLs, all URLs, and
reject non-URLs, all non-URLs. There is no reason to do anything else, other
than lazyness maybe, and even then you are lying if you claim that you are
validating URLs - you are not. If you say you accept a URL, and I paste a URL,
your software is broken if it then rejects that URL as invalid.

This does not apply to intentionally selecting only a subset of URLs that are
applicable in a given context, of course - if the URL is to be retrieved by an
HTTP client, it's perfectly fine to reject non-HTTP URLs, of course, but any
kind of "nobody is going to use that anyhow" is not a good reason. In
particular, that kind of rejection most certainly is something that should not
happen in the parser as that is likely to give inconsistent results as the
parser usually works at the wrong level of abstraction.

> The RFC does not reflect reality either (which, ironically, is what you seem
> to be complaining about).

Well, or reality does not match the RFC?

> If you’re looking for a spec-compliant solution, the spec to follow is
> [http://url.spec.whatwg.org/](http://url.spec.whatwg.org/).

A spec for a formal language that doesn't contain a grammar? The world is
getting crazier every day ...

> That doesn’t mean there aren’t any situations in which I need/want to
> blacklist some technically valid URL constructs.

Yeah, but blocking IPv4 literals of certain address ranges seems like a stupid
idea nonetheless. Good software should accept any input that is meaningful to
it and that is not a security problem. And as I said above, such rejection
most certainly should not happen in the parser.

~~~
mathias
> > The RFC does not reflect reality either (which, ironically, is what you
> seem to be complaining about).

> Well, or reality does not match the RFC?

Doesn’t matter – if there’s a discrepancy between what a document says and
what implementors do, that document is but a work of fiction.

> And as I said above, such rejection most certainly should not happen in the
> parser.

This is not a parser.

~~~
zAy0LfpBZLC8mAC
> Doesn’t matter – if there’s a discrepancy between what a document says and
> what implementors do, that document is but a work of fiction.

Yes and no. When there is a de-facto standard that just doesn't happen to
match the published standard, yeah, sure. Otherwise, bug compatibility is a
terrible idea and should be avoided as much as possible, many security
problems have resulted from that.

> This is not a parser.

Well, even worse then. Manually integrating semantics from higher layers into
parsing machinery (which it is, never mind the fact that you don't capture any
of the syntactic elements within that parsing automaton) is both extremely
error prone and gives you terrible maintainability.

edit:

For the fun of it, I just had a look at the "winning entry" (diegoperini).
Unsurprisingly, it's broken. It was trivial to find cases that it will reject
that you most certainly don't intend to reject. For exactly the reasons
pointed out above.

------
bdarnell
Another important dimension when evaluating these regexes is performance. The
Gruber v2 regex has exponential (?) behavior on certain pathological inputs
(at least in the python re module).

There are some examples of these pathological inputs at
[https://github.com/tornadoweb/tornado/blob/master/tornado/te...](https://github.com/tornadoweb/tornado/blob/master/tornado/test/escape_test.py#L20-29)

~~~
keeperofdakeys
From experience, the python re module does weird things sometimes. There is a
better third-party regex module,
[https://pypi.python.org/pypi/regex](https://pypi.python.org/pypi/regex).

~~~
pipeep
Does it use NFA?

[http://swtch.com/~rsc/regexp/regexp1.html](http://swtch.com/~rsc/regexp/regexp1.html)

Because the issue with the URL regex mentioned is with backtracking.

------
mdavidn
Use a standard URI parser to break this problem into smaller parts. Let a
modern URI library worry about arcane details like spaces, fragments,
userinfo, IPv6 hosts, etc.

    
    
       uri = URI.parse(target).normalize
       uri.absolute? or raise 'URI not absolute'
       %w[ http https ftp ].include?(uri.scheme) or raise 'Unsupported URI scheme'
       # Etc

~~~
mathias
That was not an option in this case, as the goal is to validate URLs entered
as user input and blacklist certain URL constructs even though they’re
technically valid.

~~~
mdavidn
So check if uri.host matches your blacklist.

------
MatthewWilkes
Why no IPv6 addresses in the test cases?

------
VaucGiaps
Why not put in some of the new TLDs as test cases... ;)

------
eridius
John Gruber (of daringfireball.com) came up with a regex for extracting URLs
from text (Twitter-like) years ago, and has improved it since. The current
version is found at
[https://gist.github.com/gruber/249502](https://gist.github.com/gruber/249502).

I haven't tested it myself, but it's worth looking at.

Original post:
[http://daringfireball.net/2009/11/liberal_regex_for_matching...](http://daringfireball.net/2009/11/liberal_regex_for_matching_urls)

Updated version:
[http://daringfireball.net/2010/07/improved_regex_for_matchin...](http://daringfireball.net/2010/07/improved_regex_for_matching_urls)

Most recent announcement, which contained the Gist URL:
[http://daringfireball.net/linked/2014/02/08/improved-
improve...](http://daringfireball.net/linked/2014/02/08/improved-improved-
regex)

~~~
masklinn
Isn't it the @gruber v2 column from the page? It looks to have no false
negative, but many false positives. The only one which does perfect on the
tested set is Diego Perini's
[https://gist.github.com/dperini/729294](https://gist.github.com/dperini/729294)

~~~
eridius
Hrm, you're right. I managed to miss seeing that while skimming the page.

------
Buge
Interestingly it seems [http://✪df.ws](http://✪df.ws) isn't actually valid,
even though it exists. ✪ isn't a letter[1], so it isn't allowed in
international domain names. I was looking at the latest RFC from 2010 [2] so
maybe it was allowed before that. The owner talks about all the compatibility
trouble he had after he registered it [3]. The registrar that he used for it,
Dynadot, won't let me register any name with that character, nor will
Namecheap.

[1]
[http://www.fileformat.info/info/unicode/char/272a/index.htm](http://www.fileformat.info/info/unicode/char/272a/index.htm)

[2] [http://tools.ietf.org/html/rfc5892](http://tools.ietf.org/html/rfc5892)

[3]
[http://daringfireball.net/2010/09/starstruck](http://daringfireball.net/2010/09/starstruck)

~~~
X-Istence
It's the same way with [http://💩.la/](http://💩.la/) it gets turned into the
following using IDNA: xn--ls8h.la.

That is a valid domain name and should be treated as such.

~~~
Buge
I guess you could argue the definition of "valid". According to the RFC it's

DISALLOWED: Those that should clearly not be included in IDNs. Code points
with this property value are not permitted in IDNs.

------
mnot
There is no perfect URL validation regex, because there are so many things you
can do with URLs, and so many contexts to use them with. So, it might be
perfect for the OP, but completely inappropriate for you.

That said, there is a regex in RFC3986, but that's for parsing a URI, not
validating it.

I converted 3986's ABNF to regex here:
[https://gist.github.com/mnot/138549](https://gist.github.com/mnot/138549)

However, some of the test cases in the original post (the list of URLs there
aren't available separately any more :( ) are IRIs, not URIs, so they fail;
they need to be converted to URIs first.

In the sense of the WHATWG's specs, what he's looking for _are_ URLs, so this
could be useful: [http://url.spec.whatwg.org](http://url.spec.whatwg.org)

However, I don't know of a regex that implements that, and there isn't any
ABNF to convert from there.

------
siliconc0w
This is a good lesson why you want to avoid writing your own regexes. Even
something simple like an email address can be insane:[http://ex-
parrot.com/~pdw/Mail-RFC822-Address.html](http://ex-parrot.com/~pdw/Mail-
RFC822-Address.html)

~~~
baudehlo
This always comes up in these discussions and is a terrible counter example.
RFC 822 is the format for email messages (ie headers) and not a form you'll
ever find email addresses in "in the wild" (eg on web forms).

------
lucb1e
What's wrong with IP-address URLs? If they are invalid because it says so in
some RFC, this is still not the ultimate regex. If you redirect a browser to
[http://192.168.1.1](http://192.168.1.1) it works perfectly fine.

And why must the root period behind the domain be omitted from URLs? Not only
does it work in a browser (and people end sentences with periods), the domain
should actually end in a period all the time but it's usually omitted for ease
of use. Only some DNS applications still require domains to end with root
dots.

~~~
mcpherrinm
This is the context of a URL shortener, and one of the goals seems to prohibit
bad IPs, not _all_ IPs. Bad seems to be defined here as ones in the private IP
spaces, and those in multicast space, etc.

~~~
lucb1e
Okay, that makes sense!

------
tshadwell
I've put the test cases into a refiddle:
[http://refiddle.com/refiddles/53a736c175622d2770a70400](http://refiddle.com/refiddles/53a736c175622d2770a70400)

------
droope
I just validate with this regex '^http' :P

------
JetSpiegel
It has to match this valid URL: [http://موقع.وزارة-
الاتصالات.مصر](http://موقع.وزارة-الاتصالات.مصر)

------
cobalt
What's wrong with /([\w-]+:\/\/[^\s]+)/gi

It's not fancy but it will essentially match any url

~~~
zAy0LfpBZLC8mAC
/./ will also match any URL. The point is to _reject_ non-URLs.

~~~
mathias
Yes, to reject non-URLs and also some URLs that are technically valid but that
I want to explicitly disallow anyway.

------
Sir_Cmpwn
When you have a hammer, everything looks like a nail.

------
CMCDragonkai
What does the red vs green boxes mean?

~~~
CMCDragonkai
Oh I get it now. Got confused between 1s and 0s.

------
timmm
What flavor of regex are we do making this in?

------
lazyloop
You do realize that RFC 3986 actually contains an official regular expression,
right?
[http://tools.ietf.org/html/rfc3986#appendix-B](http://tools.ietf.org/html/rfc3986#appendix-B)

~~~
saraid216
That doesn't work for validation.

~~~
lazyloop
Of course it does, if that regular expression matches we have a valid URI.
It's silly what gets downvoted on HN these days.

~~~
zAy0LfpBZLC8mAC
No, it doesn't. This regular expression matches strings that are not URIs, and
that should be quite obvious if you compare the grammar in that same RFC to
the regex.

~~~
lazyloop
Yes, it does. And it is quite obvious if you compare the grammar to the
regular expression.

~~~
zAy0LfpBZLC8mAC
Please show how to produce the following string (which is matched by that
regex) from the grammar (without the quotes):

"%x"

(edit: actually, feel free to do it with the quotes included if you like, that
would still be matched by that regex)

