Hacker News new | past | comments | ask | show | jobs | submit login
In search of the perfect URL validation regex (2010) (mathiasbynens.be)
152 points by Jonhoo on Sept 25, 2021 | hide | past | favorite | 64 comments



Even if you built a URL validation regex that follows rfc3986[1] and rfc3987[2], you will still get user bug reports because web browsers follow a different standard.

For example, <http://example.com./> , <http:///example.com/> and <https://en.wikipedia.org/wiki/Space (punctuation)> are classified as invalid urls in the blog, but they are accepted in the browser.

As the creator of cURL puts it, there is no URL standard[3].

[1]: https://www.ietf.org/rfc/rfc3986.txt

[2]: https://www.ietf.org/rfc/rfc3987.txt

[3]: https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/



Tangentially, Youtube had a bug surface last year where adding that extra dot let you avoid all ads. Previous discussion[1]

[1] https://news.ycombinator.com/item?id=23479435


This "bug", can definitely also be known as a feature ;-)


Also nearly every paywalled media site


There might not have been a generally accepted standard then, but there is now: https://url.spec.whatwg.org/


There's also a question of what we're really trying to validate, IMHO. All of these regex patterns will tell you that a string looks like a URL, but they won't actually tell you if: There's any web server listening at that particular URL; Whether that server has the resource in that location; If that server is reachable from where you want to fetch it; etc.


> All of these regex patterns will tell you that a string looks like a URL,

yeah that's it that's what they're trying to validate


It seems like the answer is almost always yes.


Using https://regex.help/, I got this beauty which passes all the ones, which should pass. Obviously some room for improvement ;) But it works!

  ^(?:http(?:(?:://(?:(?:(?:code\.google\.com/events/#&product=browser|\-\.~_!\$&'\(\)\*\+,;=:%40:80%2f::::::@ex\.com|foo\.(?:bar/\?q=Test%20URL\-encoded%20stuff|com/(?:\(something\)\?after=parens|unicode_\(\)_in_parens|b_(?:\(wiki\)(?:_blah)?#cite\-1|b(?:_\(wiki\)_\(again\)|/))))|uid(?::password@ex\.com(?::8080)?/|@ex\.com(?::8080)?/)|www\.ex\.com/wpstyle/\?p=364|223\.255\.255\.254|उदाहरण\.परीक्षा|1(?:42\.42\.1\.1/|337\.net)|مثال\.إختبار|df\.ws/123|a\.b\-c\.de|\.ws/䨹|⌘\.ws/|例子\.测试|j\.mp)|142\.42\.1\.1:8080/)|\.damowmow\.com/)|s://(?:www\.ex\.com/foo/\?bar=baz&inga=42&quux|foo_bar\.ex\.com/))|://(?:uid(?::password@ex\.com(?::8080)?|@ex\.com(?::8080)?)|foo\.com/b_b(?:_\(wiki\))?|⌘\.ws))|ftp://foo\.bar/baz)$
I had to replace some words with shorter ones to squeeze under 1000 char limit and there's no way to provide negative examples right now. Something to fix!


> ⌘\.ws

I guess this is the regex equivalent of overfitting :)


Yeah, not to mention "code.google.com" being right in there!


Yeah, grex (the library powering this) is really cool, but doesn’t generalize very well. I’m sure there are ways to improve it, but it’s not a trivial thing to do.


Any sufficiently advanced technology is indistinguishable from magic.

This kind of feels like a magic spell :)


Two past discussions, for the curious:

In search of the perfect URL validation regex - https://news.ycombinator.com/item?id=10019795 - Aug 2015 (77 comments)

In search of the perfect URL validation regex - https://news.ycombinator.com/item?id=7928968 - June 2014 (81 comments)


> Assume that this regex will be used for a public URL shortener written in PHP, so URLs like http://localhost/, //foo.bar/, ://foo.bar/, data:text/plain;charset=utf-8,OHAI and tel:+1234567890 shouldn’t pass (even though they’re technically valid)

At Transcend, we need to allow site owners to regulate any arbitrary network traffic, so our data flow input UI¹ was designed to detect all valid hosts (including local hosts, IDN, IPv6 literal addresses, etc) and URLs (host-relative, protocol-relative, and absolute). If the site owner inputs content that is not a valid host or URL, then we treat their input as a regex.

I came up with these simple utilities built on top of the URL interface standard² to detect all valid hosts & URLs:

• isValidHost: https://gist.github.com/eligrey/6549ad0a635fa07749238911b429...

Example valid inputs:

  host.example
  はじめよう.みんな (IDN domain; xn--p8j9a0d9c9a.xn--q9jyb4c)
  [::1] (IPv6 address)
  0xdeadbeef (IPv4 address; 222.173.190.239)
  123.456 (IPv4 address; 123.0.1.200)
  123456789 (IPv4 address; 7.91.205.21)
  localhost
• isValidURL (and isValidAbsoluteURL): https://gist.github.com/eligrey/443d51fab55864005ffb3873204b...

Example valid inputs to isValidURL:

  https://absolute-url.example
  //relative-protocol.example
  /relative-path-example
1. https://docs.transcend.io/docs/configuring-data-flows

2. https://developer.mozilla.org/en-US/docs/Web/API/URL


while not terribly important or outright not required this fails (treats urls as regex) for link-local addresses with device identifier (zone-id) applied like "[fe80::8caa:8cff:fe80:ff32%eth0]" although that would need to be fixed in the standard if its desired :)

i've found some reasoning[0] as to why its not supported with browsers in mind though.

[0] https://www.w3.org/Bugs/Public/show_bug.cgi?id=27234#c2


> I also don’t want to allow every possible technically valid URL — quite the opposite.

Well, that should make things a lot easier. What does he mean here? The rest of the text doesn't make it clear to me, unless it's meant to be "every possibly valid HTTP, HTTPS, or FTP URL" which isn't exactly "the opposite".


The next paragraph might be that clarification, although I agree it isn't totally clear what he meant there:

> Assume that this regex will be used for a public URL shortener written in PHP, so URLs like http://localhost/, //foo.bar/, ://foo.bar/, data:text/plain;charset=utf-8,OHAI and tel:+1234567890 shouldn’t pass (even though they’re technically valid). Also, in this case I only want to allow the HTTP, HTTPS and FTP protocols.


Honest question: there is a famous and very funny stack exchange answer on the topic of parsing html with a regex [1] that states that the problem is in general impossible and if if you find yourself doing this, something has gone wrong and you should re-evaluate your life choices / pray to Cthulu.

So, does this apply to URLs? The fact that these regexes are....so huge...makes me think that something is fundamentally wrong. Are URLs describable in a Chomsky Type 3 grammar? Are they sufficiently regular that using a Regex is sensible? What do the actual browsers do?

[1] https://stackoverflow.com/questions/1732348/regex-match-open...


Caveats: I know nothing of Chomsky Grammars, and I have only a passing familiarity with Cthulu, but IMO the real crux of the issue parsing html with regex (beyond all the “it’s hard”, “the spec is more complicated than you think”, “regex is impossible to read” etc.) is html is a recursive data structure, e.g. you can have a div, inside a div, inside a div ad infinitum. Regex, AFAIK, doesn’t allow you to describe recursion, so you’re left with regex plus supporting code. You’ll then have an impedance mismatch between the two.

URLs are not recursive structures, so I’d say the single hardest feature of html is not present.


The times I had to use it on HTML , I think I combined xPath with RegEx to close the mismatch.


I haven't looked at the BNF(s) for URIs lately, but I don't recall there being any recursion, so I wouldn't be surprised if the language were regular.

There was a Perl program that would take something like a BNF and barf out a gigantic regex (maybe with some maximum depth).


>So, does this apply to URLs? The fact that these regexes are....so huge...makes me think that something is fundamentally wrong

Yes, if your regex is above {.../50/100/...} characters, then write parser.

I struggle to understand why do people write those crazy regexes for emails, urls, html when probably in all popular technologies there are battle-tested parsers for those things.


On top of this the error messages with a regex will be very one-dimensional.

As an example, http://localhost/ is technically valid url, which he wants to block. Should this error say misformatted URL like all others?

Using regex to cover all such cases is really the wrong tool for the job.


Sometimes you're given an arbitrary bag of bytes with best-effort well-formed data. Regexes are gross but quite good for those cases where you need to try to rip out some bits from the data abyss.


I was just struggling with this -- specifically, our users' "UX" expectation that entering "example.com" should work when asked for their website URL.

Most URL validation rules/regex/librairies/etc. reject "example.com". However, if you head over to Stripe (for example), in the account settings, when asked for your company's URL, Stripe will accept "example.com", and assume "http://" as the prefix (which yes, can have its own problems)

What's a good solution? I both want to validate URLs, but also let users enter "example.com". But if I simply do

    if(validateURL(url)) {
      return true;
    } else if(validateURL("http://" + url)) {
      return true;
    } else {
      return false;
    }
i.e. validate the given URL, and as a fallback, try to validate "http://" + the given url, that opens the door to weird, non-URLs strings being incorrectly validated...

Help :-)


Parse, don’t validate. If you need a heuristic that accepts non-URL strings as if they were valid URLs, you should convert those non-URL strings to valid URLs so the rest of your code can just deal with valid URLs.

    if (validateURL(url)) {
      return url;
    } else if (validateURL("http://" + url)) {
      return "http://" + url;
    } else {
      return null;
    }


I know we're not golfing, but it pains me to see that repetition in the middle. Mightn't we write

    if (!validateURL(url)) {
        url = "http://" + url;
        if (!validateURL(url)) {
            url = null;
        }
    }
    return url;
to snip a small probability of a bug?


I find that branchiness (and mutation of the variable) harder to follow. Personally, I’d just take “parse, don’t validate” to its logical conclusion and go for:

    const parseUrl = url => validateUrl(url) ? url : null;
    return parseUrl(url) || parseUrl('http://'+url) || null;


Address validators for online checkout are notoriously inaccurate, though they still help a lot. You just have to prompt the user, "Did you mean 123 Example St?"

I'd probably do the same for poorly formatted URLs. When the user hits Submit, a prompt appears saying, "Did you mean `https://example.com`?"


i would suggest bias your implementation against false negatives. They can always come back and update it if it's wrong, and their url could just as easily be "valid" but incorrect, eg any typo in a domain name.

if it's really important, you could try making a request to the url and see if it loads, but that still doesn't validate its the url they intended to input.

might be cool to load the url with puppeteer and capture a screenshot of the page. if they can't recognize their own website, it's on them.


This could potentially be abused, but you could actually try to resolve the DNS to determine if it's valid (could be weird for some cases like localhost or IP addresses). Or just do a "curl https://whatever.com" and see what happens (assuming that all of the websites are running a webserver, although idk if that is true in your situation)


@stephenhay seems to be the winner here if you don't need IP addresses (or weird dashed URLS). It's only 38 characters long and easy to understand.

    @^(https?|ftp)://[^\s/$.?#].[^\s]*$@iS
The simpler the better, if you're going to use something that is not ideal.


Doesn’t cover mailto: which is fairly common. To be pedantic/strict, mailto: are URIs not URLs.


Tangentially related, but mentioning to hopefully save someone time: if you ever find yourself wanting to check if a version string is semver or not, before inventing your own, there is an official regex that’s provided.

I just discovered this yesterday and I’m glad I didn’t have to come up with this:

https://semver.org/#is-there-a-suggested-regular-expression-...

My use case for it: https://github.com/typesense/typesense-website/blob/25562d02...


The rules here don't make sense to me. http://223.255.255.254 must be allowed and http://10.1.1.1 must not. This is to provide security for the 10.0.0.0/8 range? This doesn't do that, because foo.com could resolve to 10.1.1.1 .


I was once failed on a technical interview, partly because on the coding test I was asked to write a url parser "from scratch, the way a browser would do it" and I explained it would take way too long to account for every edge case in the URL RFC but that I could do a quick and dirty approach for common urls.

After I did this, the interviewer stopped me and told me in a negative way that he expected me to use a regex, which kinda shows he had no idea how a web browser works.


how would you even "parse" a url with a regex? dynamically defined named subpatterns for each url parameter? I think the best i could do on paper with a regex is say "yup this is a url" or maybe "yup i can count the number of params"

Unless it was a specific url with specific params?


Match groups so you can split it up into scheme, username, password, host, port, path, query, fragment. Not difficult to approximate, though for best results with diverse schemes you’d want an engine that allows repeated named groups, and I don’t know if any do (JavaScript and Python don’t).


Python's `regex` package does allow repeated named group.


I mean ya that would match a query string, but it wouldn't parse it?


I assume they meant "some regex implementation, including replace and/or match groups".

Like, for just the params part (yes, broken and simplistic):

  #!/usr/bin/perl
  $_="a=b&c=d&e=f&whatever=some thing";
  while (s/^([^&]*)=([^&]*)(&|$)//) {
    print "[$1] [$2]\n";
  }


Ya. I've also suffered copypasta trials administered by bar raisers, mensa members, and other self appointed keepers of the sacred nerd flame.

My imagined remedies are no 1:1 interviews and recording these sessions for "possible quality assurance and training purposes".


How do browsers parse URLs then?


There's actually a standard for it these days.

https://url.spec.whatwg.org/#url-parsing


Here’s a polyfill for the JS URL() interface which should give you a taste: https://github.com/zloirock/core-js/blob/272ac1b4515c5cfbf34... (I tried finding the one in Firefox but I couldn’t actually work out where it started, this one is much easier to follow)

TLDR: it’s a traditional parser—a big state machine that steps through the URL character by character and tokenizes it into the relevant pieces.


Did you point out that his two requirements were contradictory?


its not very likely this is whats happening here but i feel like this could be done on purpose to see how you act in this kind of situation. it kinda tells how you would act once you inevitably go into a conflict with colleagues arguing over stuff like that.


Or it could be one of those outsourced interviews.


In that case I think the proper response should be: “I am very sure that browsers don’t do it that way. But let’s have a look.” And then pull up the source code for Chromium and Firefox. Assuming it’s not whiteboard only.

And if they still insist even after the source of Chromium and FF has been consulted. Well then it’s time to leave. Don’t want to work with anyone like that.


How do browsers parse URLs then?



I use this:

u.checkURL = function (string) {

    if ($.type(string) === "string") {

        if (/^(https?|ftp):(\/\/(((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:)*@)?((\[(|(v[\da-f]{1,}\.(([a-z]|\d|-|\.|_|~)|[!\$&'\(\)\*\+,;=]|:)+))\])|((\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]))|(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=])*)(:\d*)?)(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*|(\/((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*)?)|((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)+(\/(([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)*)*)|((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)){0})(\?((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|[\uE000-\uF8FF]|\/|\?)*)?(\#((([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(%[\da-f]{2})|[!\$&'\(\)\*\+,;=]|:|@)|\/|\?)\*)?$/i.test(string)) {

            return true;

        } else {

            return false;

        }

    } else {

        return false;

    }

}


    if(predicate) {
        return true;
    } else {
        return false;
    }
Can just be written;

    return predicate;
So, your code above can be:

    return &.type(string) === “string” && /regex/i.test(string);
Reg-exes like this are truly hideous though, they may as well be written in Brainfuck for all their lack of maintainability and readability.

I will never understand why regular expressions are considered the best tool for the job when it comes to parsing; they are far too terse, and do not declare intent in any way.

Software development is not just about communicating with the computer, it’s about communicating with other engineers, so we can work collaboratively. Regular expressions are the antithesis to that way of thinking


If you're writing browser JS, just use the URL builtin

  const isValidUrl = urlString => {
    try {
      new URL(urlString);
      return true;
    } catch {
      return false;
    }
  };


Here's the least imperfect Regex with Unit Tests on Regex101: https://regex101.com/r/IqI7KW/2


Can someone convert diegoperini's regex into a form compatible with Emacs Lisp? I freely admit to this being beyond my brainpower.


No validation of hex encoded IPv4? Or did I just miss it on my quick scroll through.


Uh oh, Regex is approaching sentience.


Every known sentient being is a finite state machine. Every finite state machine corresponds to a regular expression, and vice versa.


> Every known sentient being is a finite state machine.

I know this is just a cutesy slogan, but how could you possibly know whether a living creature is a finite state machine? What would it even mean? I know I don't respond identically to identical stimuli presented on different occasions ….


> I know this is just a cutesy slogan

Mostly, yes, but I do think there's a real point here as well.

> how could you possibly know whether a living creature is a finite state machine?

As I understand it, physicists don't really know whether the physical world has a finite number of states, or an infinite number. I think they tend to lean toward finite, though.

Even if it's infinite, I doubt it's of consequence. That is to say, I doubt that sentience depends on the physical possibility of an infinite number of states. (Of course, if it turns out the physical world only has a finite number of states, that demonstrates that sentience is compatible with the finite-states constraint.)

> What would it even mean?

Systems can be modelled as finite state machines. Sentient entities like people are extremely sophisticated systems, but that's just a matter of degree, not of category.

> I know I don't respond identically to identical stimuli presented on different occasions

Right, because you're in a different state. You'll never be in the same state twice. We don't need to resort to non-determinism.


Obnoxious, I mean, trivial, answer: Just make "occasions" a variable. Assuming your lifetime is finite, you could simply assign each point in time to a value, and there you have it: a finite mapping from each moment to a state.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: