Even if you built a URL validation regex that follows rfc3986[1] and rfc3987[2], you will still get user bug reports because web browsers follow a different standard.
There's also a question of what we're really trying to validate, IMHO. All of these regex patterns will tell you that a string looks like a URL, but they won't actually tell you if: There's any web server listening at that particular URL; Whether that server has the resource in that location; If that server is reachable from where you want to fetch it; etc.
I had to replace some words with shorter ones to squeeze under 1000 char limit and there's no way to provide negative examples right now. Something to fix!
Yeah, grex (the library powering this) is really cool, but doesn’t generalize very well. I’m sure there are ways to improve it, but it’s not a trivial thing to do.
> Assume that this regex will be used for a public URL shortener written in PHP, so URLs like http://localhost/, //foo.bar/, ://foo.bar/, data:text/plain;charset=utf-8,OHAI and tel:+1234567890 shouldn’t pass (even though they’re technically valid)
At Transcend, we need to allow site owners to regulate any arbitrary network traffic, so our data flow input UI¹ was designed to detect all valid hosts (including local hosts, IDN, IPv6 literal addresses, etc) and URLs (host-relative, protocol-relative, and absolute). If the site owner inputs content that is not a valid host or URL, then we treat their input as a regex.
I came up with these simple utilities built on top of the URL interface standard² to detect all valid hosts & URLs:
while not terribly important or outright not required this fails (treats urls as regex) for link-local addresses with device identifier (zone-id) applied like "[fe80::8caa:8cff:fe80:ff32%eth0]" although that would need to be fixed in the standard if its desired :)
i've found some reasoning[0] as to why its not supported with browsers in mind though.
> I also don’t want to allow every possible technically valid URL — quite the opposite.
Well, that should make things a lot easier. What does he mean here? The rest of the text doesn't make it clear to me, unless it's meant to be "every possibly valid HTTP, HTTPS, or FTP URL" which isn't exactly "the opposite".
The next paragraph might be that clarification, although I agree it isn't totally clear what he meant there:
> Assume that this regex will be used for a public URL shortener written in PHP, so URLs like http://localhost/, //foo.bar/, ://foo.bar/, data:text/plain;charset=utf-8,OHAI and tel:+1234567890 shouldn’t pass (even though they’re technically valid). Also, in this case I only want to allow the HTTP, HTTPS and FTP protocols.
Honest question: there is a famous and very funny stack exchange answer on the topic of parsing html with a regex [1] that states that the problem is in general impossible and if if you find yourself doing this, something has gone wrong and you should re-evaluate your life choices / pray to Cthulu.
So, does this apply to URLs? The fact that these regexes are....so huge...makes me think that something is fundamentally wrong. Are URLs describable in a Chomsky Type 3 grammar? Are they sufficiently regular that using a Regex is sensible? What do the actual browsers do?
Caveats: I know nothing of Chomsky Grammars, and I have only a passing familiarity with Cthulu, but IMO the real crux of the issue parsing html with regex (beyond all the “it’s hard”, “the spec is more complicated than you think”, “regex is impossible to read” etc.) is html is a recursive data structure, e.g. you can have a div, inside a div, inside a div ad infinitum. Regex, AFAIK, doesn’t allow you to describe recursion, so you’re left with regex plus supporting code. You’ll then have an impedance mismatch between the two.
URLs are not recursive structures, so I’d say the single hardest feature of html is not present.
>So, does this apply to URLs? The fact that these regexes are....so huge...makes me think that something is fundamentally wrong
Yes, if your regex is above {.../50/100/...} characters, then write parser.
I struggle to understand why do people write those crazy regexes for emails, urls, html when probably in all popular technologies there are battle-tested parsers for those things.
Sometimes you're given an arbitrary bag of bytes with best-effort well-formed data. Regexes are gross but quite good for those cases where you need to try to rip out some bits from the data abyss.
I was just struggling with this -- specifically, our users' "UX" expectation that entering "example.com" should work when asked for their website URL.
Most URL validation rules/regex/librairies/etc. reject "example.com". However, if you head over to Stripe (for example), in the account settings, when asked for your company's URL, Stripe will accept "example.com", and assume "http://" as the prefix (which yes, can have its own problems)
What's a good solution? I both want to validate URLs, but also let users enter "example.com". But if I simply do
i.e. validate the given URL, and as a fallback, try to validate "http://" + the given url, that opens the door to weird, non-URLs strings being incorrectly validated...
Parse, don’t validate. If you need a heuristic that accepts non-URL strings as if they were valid URLs, you should convert those non-URL strings to valid URLs so the rest of your code can just deal with valid URLs.
I find that branchiness (and mutation of the variable) harder to follow. Personally, I’d just take “parse, don’t validate” to its logical conclusion and go for:
Address validators for online checkout are notoriously inaccurate, though they still help a lot. You just have to prompt the user, "Did you mean 123 Example St?"
I'd probably do the same for poorly formatted URLs. When the user hits Submit, a prompt appears saying, "Did you mean `https://example.com`?"
i would suggest bias your implementation against false negatives. They can always come back and update it if it's wrong, and their url could just as easily be "valid" but incorrect, eg any typo in a domain name.
if it's really important, you could try making a request to the url and see if it loads, but that still doesn't validate its the url they intended to input.
might be cool to load the url with puppeteer and capture a screenshot of the page. if they can't recognize their own website, it's on them.
This could potentially be abused, but you could actually try to resolve the DNS to determine if it's valid (could be weird for some cases like localhost or IP addresses). Or just do a "curl https://whatever.com" and see what happens (assuming that all of the websites are running a webserver, although idk if that is true in your situation)
Tangentially related, but mentioning to hopefully save someone time: if you ever find yourself wanting to check if a version string is semver or not, before inventing your own, there is an official regex that’s provided.
I just discovered this yesterday and I’m glad I didn’t have to come up with this:
The rules here don't make sense to me. http://223.255.255.254 must be allowed and http://10.1.1.1 must not. This is to provide security for the 10.0.0.0/8 range? This doesn't do that, because foo.com could resolve to 10.1.1.1 .
I was once failed on a technical interview, partly because on the coding test I was asked to write a url parser "from scratch, the way a browser would do it" and I explained it would take way too long to account for every edge case in the URL RFC but that I could do a quick and dirty approach for common urls.
After I did this, the interviewer stopped me and told me in a negative way that he expected me to use a regex, which kinda shows he had no idea how a web browser works.
how would you even "parse" a url with a regex? dynamically defined named subpatterns for each url parameter? I think the best i could do on paper with a regex is say "yup this is a url" or maybe "yup i can count the number of params"
Unless it was a specific url with specific params?
Match groups so you can split it up into scheme, username, password, host, port, path, query, fragment. Not difficult to approximate, though for best results with diverse schemes you’d want an engine that allows repeated named groups, and I don’t know if any do (JavaScript and Python don’t).
Here’s a polyfill for the JS URL() interface which should give you a taste: https://github.com/zloirock/core-js/blob/272ac1b4515c5cfbf34... (I tried finding the one in Firefox but I couldn’t actually work out where it started, this one is much easier to follow)
TLDR: it’s a traditional parser—a big state machine that steps through the URL character by character and tokenizes it into the relevant pieces.
its not very likely this is whats happening here but i feel like this could be done on purpose to see how you act in this kind of situation. it kinda tells how you would act once you inevitably go into a conflict with colleagues arguing over stuff like that.
In that case I think the proper response should be: “I am very sure that browsers don’t do it that way. But let’s have a look.” And then pull up the source code for Chromium and Firefox. Assuming it’s not whiteboard only.
And if they still insist even after the source of Chromium and FF has been consulted. Well then it’s time to leave. Don’t want to work with anyone like that.
Reg-exes like this are truly hideous though, they may as well be written in Brainfuck for all their lack of maintainability and readability.
I will never understand why regular expressions are considered the best tool for the job when it comes to parsing; they are far too terse, and do not declare intent in any way.
Software development is not just about communicating with the computer, it’s about communicating with other engineers, so we can work collaboratively. Regular expressions are the antithesis to that way of thinking
> Every known sentient being is a finite state machine.
I know this is just a cutesy slogan, but how could you possibly know whether a living creature is a finite state machine? What would it even mean? I know I don't respond identically to identical stimuli presented on different occasions ….
Mostly, yes, but I do think there's a real point here as well.
> how could you possibly know whether a living creature is a finite state machine?
As I understand it, physicists don't really know whether the physical world has a finite number of states, or an infinite number. I think they tend to lean toward finite, though.
Even if it's infinite, I doubt it's of consequence. That is to say, I doubt that sentience depends on the physical possibility of an infinite number of states. (Of course, if it turns out the physical world only has a finite number of states, that demonstrates that sentience is compatible with the finite-states constraint.)
> What would it even mean?
Systems can be modelled as finite state machines. Sentient entities like people are extremely sophisticated systems, but that's just a matter of degree, not of category.
> I know I don't respond identically to identical stimuli presented on different occasions
Right, because you're in a different state. You'll never be in the same state twice. We don't need to resort to non-determinism.
Obnoxious, I mean, trivial, answer: Just make "occasions" a variable. Assuming your lifetime is finite, you could simply assign each point in time to a value, and there you have it: a finite mapping from each moment to a state.
For example, <http://example.com./> , <http:///example.com/> and <https://en.wikipedia.org/wiki/Space (punctuation)> are classified as invalid urls in the blog, but they are accepted in the browser.
As the creator of cURL puts it, there is no URL standard[3].
[1]: https://www.ietf.org/rfc/rfc3986.txt
[2]: https://www.ietf.org/rfc/rfc3987.txt
[3]: https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/