
URL Parsing in WebKit - ash_gti
https://webkit.org/blog/7086/url-parsing-in-webkit/
======
treve
For what it's worth, the file:///foo/bar url means:

1\. an empty hostname 2\. path = 'foo/bar'

In the file:// url an empty hostname implies 'localhost'. The file:// is, as
far as I know, unique in this regard.

When parsing URI's generically, and you encounter a URI like:

scheme:foo/bar

The meaning of 'foo/bar' is also a path. There's many URI schemes like this,
such as mailto: and urn:.

So given that, file:foo/bar should probably throw an error, but if not, it
should be canonicalized into file:///foo/bar (triple slash), because
file://foo/bar refers to the 'bar' file on the 'foo' host, and file:///foo/bar
refers to /foo/bar on localhost.

~~~
derefr
One annoying thing: web browsers parse URLs two different ways!

1\. Typing "example.com" into your address bar takes you to "//example.com/"
(root path of example.com domain in current/default schema)

2\. Clicking on a link with a href="example.com" takes you to
"///./example.com" (that is: current domain, current schema, relative path
reference).

The URL parsers built into most runtimes use behavior #2, and I can see the
usefulness of it in the sense that you can sort of treat a URL as an extended
version of a filesystem Path object, where any string that forms a valid Path
(relative or absolute) also forms a valid URL with equivalent semantics.

But most URLs people _type_ , in the wild, are implicitly assuming behaviour
#1. If you write an unstructured-text "ingestor" that extracts URLs embedded
in plaintext on the web or in print, and tries to dereference said URLs, only
approach #1 will get you anywhere.

That said, I've never seen a single URL library that exposes any parsing API
for type-1 URL fragments. It'd be _extremely_ useful for parsing URLs entered
by humans as responses to prompts.

~~~
greglindahl
Google around for a "public suffix list" library for your favorite language,
it will aid you in guessing that 'example.com' could be a valid FQDN. As a
crawler guy, I see common suffix list features in the kinds of url libraries
that I use or write.

~~~
tokenizerrr
Don't forget to update your library whenever a new gTLD comes out

~~~
greglindahl
It appears that most libraries download the list from Mozilla, but yes, you do
need to ensure that it's downloaded frequently.

------
jacksonsabey
I have an implementation, although it's currently closed source and is only
available via API: [http://0ut.ca/documentation](http://0ut.ca/documentation)

I believe it's closest to the standard that I've found, and if it isn't I
would like to correct that.

There is a Strict parser which will fail on any error, and Loose parser which
will discard errors when possible and follow the defacto parsing
implementations.

It should be able to handle any of the edge cases, such as partially percent
encoded unicode, invalid characters, normalization, or octal/hex ipv4
addresses. The only thing from your linked unittests that it will not handle
is | and \ for windows paths, they will be encoded.

You can easily compare the expected output in your browser here if anyone is
interested in seeing how parsing is done:
[http://0ut.ca/api;v1.0/validate/uri/after?hTtPs://foo:%F0%9F...](http://0ut.ca/api;v1.0/validate/uri/after?hTtPs://foo:%F0%9F%92%A9@0xC0.250.01:80/foo/../../../bar//?a=b&c=d)
You can also try validating strange relative URIs:
[http://0ut.ca/api;v1.0/validate/uri/after?+invalid-
scheme:/p...](http://0ut.ca/api;v1.0/validate/uri/after?+invalid-
scheme:/path)?

I would be happy to explain any of the reasoning behind the parsing if anyone
is interested.

~~~
watersb
Wow, thanks!

Your tool helps me because it's like an EXAMPLES section of a man page.

------
SloopJon
The problem of inconsistent URL parsing doesn't just apply to browsers. This
story prompted me to look for "What every web developer must know about URL
encoding," which was posted a while back:

[https://news.ycombinator.com/item?id=5930494](https://news.ycombinator.com/item?id=5930494)

Unfortunately, the blog post is 404. Here's the most recent version from
archive.org:

[https://web.archive.org/web/20151229061347/http://blog.lunat...](https://web.archive.org/web/20151229061347/http://blog.lunatech.com/2009/02/03/what-
every-web-developer-must-know-about-url-encoding)

~~~
allending
Slightly better link:
[https://www.talisman.org/~erlkonig/misc/lunatech%5Ewhat-
ever...](https://www.talisman.org/~erlkonig/misc/lunatech%5Ewhat-every-webdev-
must-know-about-url-encoding/)

------
thaumaturgy
This is a problem that's near and dear to my heart, and more progress on
standardizing URL parsing would be lovely. I have a crawler and a data miner
that both rely on URL parsing, and it's kind of a pain. There are about a
hundred tests that the code needs to pass with each revision, and it's only a
fraction of the tests that are needed and there are still edge cases found in
the wild.

For instance, telling the difference between a web uri and a mailto uri
without the benefit of a scheme at the beginning of the uri is total
guesswork.

The parser's current approach is to return parsed URIs along with a confidence
percentage and the application logic then tries to make some additional
guesses based on context.

URI parsing is not my favorite thing.

------
daurnimator
See also [https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-
url/](https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/)

------
userbinator
_An ideal benchmark would measure performance of parsing real URLs from
popular websites, but publishing such benchmarks is problematic because URLs
often contain personally identifiable information_

You could base your benchmark on URLs obtained through crawling (without
cookies or other state) the public web.

------
witty_username
> For example, you might be trying to reduce your server’s bandwidth use by
> removing unnecessary characters in URLs.

Is this sarcasm? The savings can never be more than a few KBs per page.

------
rtpg
We standardised HTML parsing, is URL parsing really much trickier?

Maybe we need HTML6, with an opt-in doctype and every parser fully defined.

~~~
igt0
The standardization is "easy", however we should not forget the browser must
support the wrong behavior, because a bunch of applications are expecting that
behavior, otherwise the implementation of the "standardised" URL parsing would
break the web.

