
In search of the perfect URL validation regex - edward
https://mathiasbynens.be/demo/url-regex
======
billyhoffman
Please. Please Please Please. With Sugar on Top. Don't do this with a regex.

I've written a lot of web facing software that accepts URLs from the untrusted
masses and ultimately makes requests to them if they are "valid." The lesson
I've learned is simple. Regex's are terrible for this task because there are a
_ton_ of things you check and lots of normalization you need to do. Instead,
do this as a function

I've evolved mine over the years, and my use case is semi-specific: Given a
string, validate it as a fully qualified HTTP/HTTPS URL that doesn't have
credentials and isn't trying to point my software toward the internal network
or localhost. It looks like this:

\- Use system/framework library to create Uri object from source string. All
your checks will be consulting this object's properties, not looking at the
source string

\- Is scheme HTTP/HTTPS? If no, stop

\- Did they supply user:pass@ in URL? If so, stop, and yell at them for
putting usernames and passwords into a random site on the Internet.

\- If hostname is an IP address, normalize it to dotted decimal quad IPv4 or
IPv6 (no octal obfuscation for you!), and test against private IP space ranges
or loopback. If private or loopback, stop

\- If hostname is an actual hostname, normalize it + de-puny code it, and
check for localhost aliases. If local, stop (you can also do a DNS lookup and
make sure you someone isn't trying to return private/local IPs to bypass your
checks)

At this point, you have a syntactically valid, fully qualified URL pointing to
a public facing web property accessed via HTTP or HTTPS.

You don't have to worry about TLDS or the like. At this point, you can do
additional DNS checks, check the domain against lists of bad actors, whatever
else you want to do. You can try and be smart and do things like, "if supplied
URL wasn't fully qualified, prepend [http://](http://) and try validation
again" to avoid user error. Pretty flexible.

This is more rigorous than a simple regex and way way way easier for another
developer to read and understand what is going on.

~~~
elithrar
> \- Did they supply user:pass@ in URL? If so, stop, and yell at them for
> putting usernames and passwords into a random site on the Internet.

FWIW, browsers don't send the user:pass in the URL - they automatically
marshal it into the Authorization: Basic header. Obviously sending these over
HTTP is still dumb, but they're not (typically) logged in plain-text in
browser history/server request logs.

~~~
billyhoffman
Ahh. Very true. Let me be a little more clear. I work on public web apps where
there is literally an HTML text input for someone to submit a URL that we
should audit. I'm not talking about people typing URLs directly into a
browser.

At least once a month I get someone giving me a URL with embedded credentials
to a dev or staging environment for an Alexa top 50K site. Things like
[https://qa:tester@dev.major-ecomm-site.com/blah](https://qa:tester@dev.major-
ecomm-site.com/blah). It's pretty terrible, but makes for an interesting sales
call :-)

------
jetpks
Why are these in the `should fail' section?

    
    
        http://www.foo.bar./
        http://10.1.1.1
        http://10.1.1.254
        http://10.1.1.0
        http://10.1.1.255
    

The top one is just a fully qualified domain name, and the rest could be valid
host addresses depending on your subnet size.

~~~
gburt
> [http://3628126748](http://3628126748)

This shouldn't fail. It resolves to an IP owned by The Coca Cola Corp (not an
internal IP)

~~~
chimeracoder
I'm confused by how that works. I've never seen IP addresses expressed in
decimal form before (which is what I'm assuming that is).

~~~
mitchty
It is in octal actually, and it isn't overly portable.

It should also start with a 0 to be "correct". Which in this case "correct"
means the libc inet_aton() function will accept it and convert it to an ip
address. The octal portion is nonstandard technically for uri's and is more an
artifact of the libc the software you're using is sitting on.

~~~
quesera
That looks like decimal to me. I didn't convert it to test, but I see a couple
8s. :)

Regardless, there are valid dotless decimal, octal, and hex representations
for every IP address. Library support is variable.

------
colinbartlett
A huge pet peeve of mine are sites that reject the new TLDs in emails or URLs.
I use my hotfresh.pizza email address pretty often and you'd be surprised how
many sites reject that with "Please enter a valid email address." Infuriating.

~~~
dmd
You'd be surprised how many sites reject my domain, 3e.org. Sometimes it's
because it begins with a number, sometimes it's because it's too short.

~~~
LukeShu
To be fair, an old RFC referenced by an RFC that is referenced by the current
thing specifies that a segment of a domain name can't start with a number.
Obviously, that isn't true in the real world.

~~~
ikeboy
So, TLDs are selling (technically) invalid domains?

~~~
LukeShu
And every major implementation supports them.

Well, I trudged through the RFCs in ~2010, so it's possible that it's been
updated since then.

------
criley2
Why should URL's like [http://3628126748](http://3628126748) fail?

For example, [http://1249711460](http://1249711460) resolves just fine in
Chrome.

~~~
eugenekolo2
What is this witchcraft?

~~~
facetube
It's a 32-bit IP address in decimal (sorry, base ten) form, IIRC.

~~~
eugenekolo2
Figured it out also as base 256 ipv4 converted to a number.

74.125.21.100 => 100x1 + 21x256 + 125x256^2 + 74x256^3 => 1249711460

------
matt2000
Nice to have this comparison in one place, but I think also serves as a good
illustration of the limits of regex usefulness. It feels like this would just
be better off implemented in the language of your choice instead of an
impenetrable string of characters.

~~~
treve
This is partially because every single regex in the example is terribly
written. Regexes _can_ look good and understandable with the x modifier and
lots of spaces and comments.

You wouldn't write a normal function in one line with no comments, why do it
with regex?

~~~
ethbro
Obligatory "Conway's GoL in one line of APL"
[http://catpad.net/michael/apl/](http://catpad.net/michael/apl/)

------
k_sze
And now you have 404 problems.

In all seriousness, this is why I don't like the IETF's documents. They write
in a verbose way and then don't even provide a reference implementation, in
this case, a reference regex that would have done away with lots of dispute
and ambiguity. It is my opinion that, in practice, your specification is
inherently broken if you can't provide a reference implementation.

(No, BNF doesn't count as reference "implementation". Who uses BNF in their
programs to validate strings anyway?)

~~~
SFjulie1
There is more than grammar in URL: there is politics like gov.uk is a TLD.

And that is why IETF RFC are so verbose. Because there is politics in them.
That is why you can't provide regexps.

url because of politics are not context free.
[https://www.cs.rochester.edu/~nelson/courses/csc_173/grammar...](https://www.cs.rochester.edu/~nelson/courses/csc_173/grammars/cfg.html)

So a URL because of politics is a context full grammar.

And regexp cannot parse anything else than context free stuff.

To be honest I doubt anything useful is context free.

I doubt, therefore, that anything useful can be parsed with regexp... Except
float, integer, and other basic types that are useful to build a context full
grammar... But for this RFC should separate standards in context free stuffs
(rules/context free stuff) and config file for the political/commercial stuffs
(context that changes meanings of atoms being parsed for illogical reasons).

The problem is politics is fucking hard to normalize, we have no BSML yet.

~~~
pekk
A minor clarification for those who might be learning from HN: regular
expressions aren't equivalent to context-free grammars, they are even more
limited than context-free grammars. Things which involve arbitrary-depth
nesting or recursion, like HTML, can't be parsed by regular expressions, but
that is exactly what context-free grammars are for. Where regular expressions
correspond to finite state automata (essentially directed graphs where nodes
are states and edges are input-driven transitions between states), context-
free grammars correspond to pushdown automata (with a stack, this is the
recursion secret sauce). Lots of interesting things are context-free. For
example, any good programming language grammar should be context-free, to
allow sane and efficient parsing. I welcome further corrections.

~~~
pjtr
In other words regular expressions are equivalent to _regular grammars_ [1].
(Except "modern regular expressions" support some construct that make them
match some non-regular grammars. [2])

[1]
[https://en.wikipedia.org/wiki/Regular_grammar](https://en.wikipedia.org/wiki/Regular_grammar)

[2]
[https://en.wikipedia.org/wiki/Regular_expression#Patterns_fo...](https://en.wikipedia.org/wiki/Regular_expression#Patterns_for_non-
regular_languages)

------
gtrubetskoy
What about an IPv6 URL, e.g.
[http://[FEDC:BA98:7654:3210:FEDC:BA98:7654:3210]:80/index.ht...](http://\[FEDC:BA98:7654:3210:FEDC:BA98:7654:3210\]:80/index.html)

~~~
recentdarkness
Exactly that is what I was thinking when I checked for the tests and I haven't
seen anything in regards to that. If they really want to get the real deal,
they would have to support that too.

However as others already pointed out, it's better not to use a regex for it,
but a proper library for your language which would bail out as soon as it hits
something invalid. With the hope that the library for the language does
support that already.

------
ipozgaj
A lot of these regexes miss the whole point of validation. The key problem
that needs to be solved here is to find if a URL is _syntactically_ correct. A
lot of them focus on the semantics as well. Thus hardcoding the TLDs is not
needed at all, so [http://microsoft.com](http://microsoft.com) and
[http://microsoft.foobar](http://microsoft.foobar) should both be valid (I
could have the latter in my /etc/hosts). Also, that's what the DNS is used
for.

------
noondip
[http://./](http://./) is a perfectly valid and very short URL, even though it
is highly unlikely anyone (InterNIC? ICANN? Verisign?) would issue an address
record for @. On a related note, has anyone seen ccTLDs do something like this
(e.g., [http://io./](http://io./), [https://co.uk./foo](https://co.uk./foo))?

~~~
nitrogen
The Vatican advertises MX records on _va._ :

    
    
        $ dig va MX
        ...
        ;; ANSWER SECTION:
        va.			3599	IN	MX	10 mx12.vatican.va.
        va.			3599	IN	MX	10 mx11.vatican.va.
        va.			3599	IN	MX	100 raphaelmx3.posta.va.

~~~
noondip
Cool! Anyone want to try emailing pope@va and seeing what comes of it?

------
bazzargh
The one marked as winning, by @diegoperini, has problems.

The way it defines passwords is wrong, so you can trick it into accepting
almost anything by putting the domain somewhere else:

    
    
        re_weburl.test("http://127.0.0.1/")
        => false
        re_weburl.test("http://127.0.0.1/@example.com")
        => true
        re_weburl.test("http://999.999.999.999.999/@example.com")
        => true
    

oops. I disagree that it should be rejecting rfc1918 addresses anyway, because
this makes it less useful in an intranet context, where you _want_ those to
work.

There's also an apples-to-oranges comparison going on here. The Gruber pattern
is not for validation, but for detecting url-like-things in text, which is why
it excludes a whole bunch of punctuation chars from appearing at the end -
when I say 'google.com.' in text, I mean 'google.com'.

Edited to add: I misremembered suggesting trailing punctuation exclusion to
Gruber, what we discussed was xxx.xxx/xxx as an alternate pattern, catching
protocol-less shortened urls in tweets.

------
seba_dos1
What about new TLDs? URLs with them should be also checked. I can see that
some regexes contain list of TLDs, which is already disqualifying.

------
Ov3rload
Why would you ever use a regex to do this, apart from experimenting?

This would be much easier to do with a context free grammar or any decent
parsing library.

------
hpaavola
In many cases you don't want
"[http://foo.bar?q=Spaces](http://foo.bar?q=Spaces) should be encoded" to
pass. If for example you want to turn URLs in comments to links, then space
should just end the URL right there. Otherwise you end up making whole
paragraphs as links.

------
matrix
While I admire the effort that people put into this (and the similar effort of
email address validation), what I really want to see is a comparison based on
"good enough" validation -vs- performance. I think a reasonably low rate
false-positives is a reasonable trade-off for fast validation.

------
jakeogh
I'm using a modified version of this for IRI/URI validation:

[https://github.com/nisavid/spruce-
iri/blob/master/spruce/iri...](https://github.com/nisavid/spruce-
iri/blob/master/spruce/iri/goose.py)

------
aruggirello
I just don't understand why so many people like to reinvent the wheel all the
time. In PHP, there's filter_var() family of functions with plenty of filter
types, like FILTER_VALIDATE_URL, or you could try to use parse_url() if you
need to add further constrains to validated URLs - like forbidding localhost,
etc. IMHO complex regular expressions should be avoided as they make debugging
a PITA and are usually a performance bottleneck, too.

------
OliverJones
Terrific. Sr. Perini's regex has served me well. That being said, if you solve
a problem with a list of thirteen regexs, you then have fourteen problems.

------
javan
Here's my easy (but not totally passing) version using the DOM:
[https://gist.github.com/javan/6aaebfeb5fe415498028](https://gist.github.com/javan/6aaebfeb5fe415498028)

------
codezero
Maybe gwern can build a neural net for this too. :)

------
msane
Great matrix

------
josephscott
Eyeballing those charts it looks like @scottgonzales scores the best, followed
by @cowboy.

~~~
function_seven
Looks like @diegoperini has a perfect score. (Site is wide so if your screen
is narrower, you won't see his)

~~~
josephscott
You are correct, I missed that one.

