1. Url is parsed into components.
2. Domain name is encoded using punycode.
3. URL path is encoded to UTF-8 and then percent-escaped.
4. Query string and fragment are encoded using page encoding and then percent-escaped. For URLs entered to address bar it seems that UTF-8 is used, not page encoding.
5. The result is joined back to a single byte string and sent to a server.
So this means parts of a single URL are encoded using 3 different encodings, and one needs to know the encoding of a page URL was extracted from to send it correctly (i.e. to send it the way browsers do). I haven't seen this algorithm stated explicitly anywhere, but this is how all browsers work. It doesn't make any sense.
a = document.createElement('a')
a.href = "/ö/o?ö/o#ö/o"
console.log(a.hash, a.pathname, a.search)
Browser .hash .pathname .search
Chrome #ö/o /%C3%B6/o ?%C3%B6/o
Firefox #%C3%B6/o /%C3%B6/o ?%C3%B6/o
Safari #%C3%B6/o /%C3%B6/o ?%C3%B6/o
Edge #ö/o /ö/o ?ö/o
IE11 #ö/o ö/o (ouch) ?ö/o
It's a big mess.
I discovered this while working on a SPA router that can handle routes in either `pathname`, `search` or `hash` mode... The routes are defined as the keys of a JS object, in Unicode in the source, and we have to juggle back and forth to match what `document.location.*` returns.
The use of the page encoding is part of the browser behavior, not any RFC. It has been retroactively specified in https://url.spec.whatwg.org/#url-parsing
Maybe I'm misunderstanding, but how could a browser use page-encoding before the page has been fetched?
a) URL is entered to the address bar. Default encoding (it seems UTF-8) is used for query and fragment.
b) User clicks on a link. Page URL is used; page encoding is already known.
c) Redirect to a new URL happens. I can't say out of my head what happens in this case.
You're saying if I click on a link with an IRI in a non-UTF8 html page, it means one thing.
But if I copy and paste that link into a text file, and later paste it into a browser, it'll mean something else entirely? Or if I save it to a bookmark using some service?
Well, for links that appear in a page to be defined using page encoding doesn't sound so shocking. That doesn't explain why the path would use UTF-8 regardless of page encoding, though.
Are you sure URLs directly entered in the address bar use UTF-8 for query strings and fragments rather than system encoding? If my OS is set to use GB2312, how do URL fragments that I type into the browser get encoded?
This question was not relevant for Scrapy because it makes sense to use UTF-8 as default regardless of what browsers do when page encoding is unknown; unlike encoding of URLs extracted from HTML this UTF-8 default doesn't lead to incompatible behavior.
The first thing to be added here, I expect, was the handling of the query string. This was needed to allow non-ASCII things to be submitted via forms; the need for this arose before there was any support for non-ASCII domain names, and likely before there was any real need for non-ascii paths.
The issue with forms as they were initially created is that they would submit to some server and the server would then process the query string. In doing so it would do whatever it did for non-ASCII stuff; there was no standard. In practice, it would percent-unescape and then treat the bytes as being in whatever encoding the server developer happened to default to. Typically this was the encoding the web page was in as well. Yes, this is totally busted if your form has a name field and the name being input is not representable in ISO-8859-1 or whatever you authored your page in, but this was an incredibly common way to handle non-ASCII in the 90s. This all predates my involvement in browsers, so I'm not sure which is the chicken and which is the egg here, but the upshot was that browsers started sending the query string in the page encoding (with various hacks that were not interoperable for a while in cases when the text was not representable in that encoding) and servers started depending on this behavior. Then the accept-charset attribute got added to <form> to allow overriding the default behavior for people who wanted something else. The behavior of form submission via the query string in terms of encodings is specified at https://html.spec.whatwg.org/multipage/forms.html#picking-an... and https://url.spec.whatwg.org/#concept-urlencoded-serializer and https://html.spec.whatwg.org/multipage/forms.html#url-encode... which bridges them.
I think the next step was url paths; for those browsers were inconsistent for a bit about whether UTF-8 was used or whether the page encoding was used, but eventually all aligned on UTF-8, and this got standardized in the HTML spec as well, I'm fairly certain.
And for hostnames, punycode encoding was more or less a must for doing DNS resolution. And at that point that ended up getting sent on the wire as well. https://tools.ietf.org/html/rfc2616#section-14.23 defines the Host header, which is presumably what we're talking about here, since that's the only way the hostname is sent to the server (note that this is NOT part of the same byte string as the path and query). Following the breadcrumbs for that through https://tools.ietf.org/html/rfc2396 it all talks about this header sending the DNS name, which is why I assume browsers settled on sending punycode here. %-encoding in hostnames wasn't really a thing for a while, I think; e.g. Firefox didn't even support it until about 6 months ago.
> The [WHATWG] spec says so because browsers have implemented the spec.
But that's exactly wrong. The WHATWG spec is based on the observed behaviour of browsers. By documenting what existing browsers do, and scoping out who is prepared to change their implementation when there are differences, it tries to specify a behaviour that is either already implemented or able to be implemented with as few changes as possible. It is possible to specify behaviour that is stricter than any current implementation, but the burden of proof for doing so is non-trivial; you have to demonstrate, to the satisfaction of all involved parties, that the proposed change will not break enough web content to cause a compatibility problem. Doing that is non-trivial, but possible if you are sufficiently motivated.
The complaint that the WHATWG spec is written by a closed group also seems like a misunderstanding. Compared to pay-to-play organisatons like the W3C, or pay-to-meet organisations like IETF, there is no concept of formal membership, or need to pay to get your voice heard. There's a mailing list, an open GitHub repository and an IRC channel. Anyone, especially prospective implementors of the spec, is welcome to participate in those forums, and they will have the same privileges as any other contributor. I imagine that for URL in particular, contributions from the CURL comminity would be most welcome.
I'm not clear though why the effective starting point is the superset of all variants currently implemented by one of the involved parties. Given that the extant standards are quite strict, couldn't an argument be made that the "burden of proof" would be on why these needed to be extended, on a case by case basis?
Regardless of the particular agreements reached, an up-to-date standard will certainly benefit everyone; it seems like it would be an excellent outcome if the standard came with a reference implementation too :-)
In fact the process is iterative. When you find that implementations do different things, you make an educated guess about what's most likely to be compatible with existing content. For example if all browsers allow an arbitrary number of slashes between the scheme and the host part of a URL, it seems very likely that some existing content relies on that, and very unlikely that you are going to get all the browsers to change their implementation, so you standardise that behaviour. On the other hand if you find that, say, one browser allows a semicolon rather than a colon after the scheme, but no other implementation does, it's very unlikely that is required for compatibility, so you typically don't allow colon-after-scheme in the spec, and speak to the existing implementation about fixing their bug.
The hard case is where there isn't a clear consensus about the required behaviour, especially when implementations are flatly contradictory. In that case you speak to people and find out whether they have some evidence that their current behaviour has good effects, and whether they are willing to change. Armed with this, and your ability to reason about what should be more compatible or otherwise desirable, you pick one behaviour and hope that people converge on it. Often, in these cases, time will pass and it will be clear that you made the wrong choice; in that case you go back and edit the spec to reflect whatever reality turned out to be.
So the process is iterative and not perfectly algorithmic; the person doing the work has to apply their judgement to make decisions where there isn't a clear path forward. But the goal, at least, is clear; it's to have a single specification that is in-line with what's actually implemented in the real world, and what's needed to consume real-world content. In this sense the URL spec is already a success; for example it was used as the basis for the Rust url library, which is believed to be compatible enough to consider using not just in Servo, but also in Firefox in place of the legacy C++ parser.
Also, I hope they acted on some real-world statistics, provided by Google's or other crawlers...
There are surely arguments for allowing any old garbage in URLs/URIs/IRIs - hey, it's easy and fun! - but, gee, it's not like bunging an identifier through a URI escape function is going to triple the code base.
Compare that with the surface area for potential bugs in the parsing code which somehow didn't account for several megabytes of whitespace embedded in a URL or multiple code pages in a single string or whatever zalgoesque horror is showing up on the security bulletins this week. It's a lot easier to proof code that only needs to deal with a strict subset of 7-bit clean ASCII and can politely decline embedded emojis.
When one web client starts being overly zealous in what it accepts, that puts an implicit onus on everybody else to start accepting that too ("all browsers do"!). Where would you draw the line? Well, we've got a couple of RFCs lying around, how about we go with those.
This stuff doesn't have to be hard. Surely the act of issuing an HTTP redirect isn't the Last Great Unsolved Problem of web engineering! I say rejoice in the beauty of URIs in canonical form. Be miserly in what you accept for a change.
The IRI RFC does however specify that for transmission/interchange the IRIs need to be percent-encoded to form valid URIs. That's pretty straightforward! A good user agent will fully transcode whatever characters I type in the address bar, generating a valid IRI. A bad user agent will complain and say that whatever it was I typed was very nice but not a valid interweb, and I have a bad day. A broken user agent may generate some broken encoding (or just pass on whatever I typed in UTF-8 or CP-1252), and some web admins may have a bad day, if they weren't already.
But the location bar is not the only place that URLs come from (less and less every day). When a web application (e.g. 301 redirect) or a resource (e.g. external document reference) uses an RFC-3987 noncompliant identifier, just how far should the user-agent bend over backwards to fetch the resource? Some browsers seem to do a lot, but it's not clearly documented just how much (see elsewhere in the comments), and among the article's laments was the observation that when any browser accepts dodgy identifiers, it places a burden on everyone else to accept the same.
Seems that WHATWG hopes to formally codify just how dodgy you can get away with without the identifier being "too dodgy", but I get the impression that this will be an expansive definition ("hey, if they did it, you can do it! we believe in you"). This is an understandable approach, as they are trying to document the state of the web, but it does put extra burden onto anyone dealing methodically with URLs henceforth, rather than being strict and drawing some clear boundaries up for people constructing URLs. I'm expecting to see a lot of "SHOULD" and not a lot of "MUST NOT" :-/
a) has to be usable by users with the lowest common denominator of tech savvy and
b) doesn't have to parse URLs out of other text because it encounters URLs only in defined fields
As a result, this software is written to be lenient about trying to interpret URLs input by the user. So it already has to have facilities to transform URLs with improperly encoded special characters such as spaces. This allows it to in turn be more lenient about what it accepts from servers.
With no clear standard for these transformations and the browser makers not sharing their parsing and transformation libraries, the writers of other software that has to deal with URLs are left playing catch-up and trying to reverse-engineer the browsers' undocumented hacks for dealing with not-very-well-formed URLs.
Moreover, many have the additional burden of having to be able to parse URLs out of other text where the delineation of URL start/end may not be clear. Some of the forms supported by browser makers (who don't have to do this) can't reasonably be supported in this context, leading to frustration for developers (let alone users).
Either you have a valid scheme or you don't, and either you have a relative path or an absolute path
Double slash after a valid scheme implies an authority, if a 3rd slash exists it's an absolute path, anything past the 3rd slash has to be normalized. So, http:////// is actually http:///
If you have an invalid scheme or a relative path such as http:/ or +http:// it would actually be normalized to ./http:/ ./+http:/
I don't think cURL implements this percent encoding yet - instead, it sends out binary paths on UTF-8 locale and Linux likewise. This was recently added in cURL's TODO though: https://curl.haxx.se/docs/todo.html#Add_support_for_IRIs
> The term URL was later changed to become URI, Uniform Resource Identifiers
> A URI can be further classified as a locator, a name, or both. The
term "Uniform Resource Locator" (URL) refers to the subset of URIs
that, in addition to identifying a resource, provide a means of
locating the resource by describing its primary access mechanism
Personally, I don't treat errors like that as massively damaging to the credibility of the rest of the article, but it is rather jarring, and I know that people do post "Stopped reading at..." comment about this sort of thing.
I was around around then, and definitely recall the conventional wisdom being "don't call them URLs anymore, call them URIs, that's the new standard." At any rate, rfc3686 claimed to be a new standard for _syntax_ -- according to the passage you quote, whether it's a "URL" or a non-URL URI is not a matter of syntax at all. The syntax spec is now called "URI". That rfc claims.
Except somehow the result has been just massive confusion over what standard should be used how for syntax of URLs/URIs, with the WHATWG just stirring the pot further (obligatory reference to xkcd there's too many standards let's create another). I don't know that there's anything wrong or to blame in the written text of rfc3986, but something in the social practice of rfc standards as manifested in URL/URI-related things, has resulted in massive confusion and a situation described in OP where there is effectively no formal standard for URLs/URIs consistently used across software.
- Somewhere within some unstructured text: Here, finding the exact end of the URL is difficult and disallowing spaces can have a great benefit. Even then you can have ambiguities: Does punctuation at the end belong to the URL? What about URLs that are wrapped in <> brackets?
Whether or not it's even possible that the URL contains non-ascii chars depends on the type of the document.
- As a value of a HTML attribute without surrounding quotes: Slightly less ambiguous but including a space would break the HTML parser. The parser works fine with unicode characters however, so there is less pressure for authors to encode them.
- As a value of a quoted HTML or XML attribute, HTTP header or data field: Charset, Start and end of the URL are communicated out-of-band so there is no pressure at all to normalize the URL.
- In the browser address bar: Start and end are unambiguous but there are other concerns: You have to detect if it's an actual URL or a search term, you have to deal with omitted parts (e.g. no scheme) and people WILL input unicode characters. You also want to display unicode characters to the user, except when you don't want to for security reasons. And then users will copy the URL in the address bar and paste them into an HTML document...
Sometimes the guff from the human looks exactly the same as a URL, sometimes it's close enough to work out what they meant, and the rest of the time it's just guff. If it looks exactly like a URL or IRI, it's trivial for the browser to turn it into a proper URI and emit an appropriate request. If it's close, then some assumptions can be made, but the browser is still responsible for constructing the actual URI. That doesn't mean that what the human typed has any bearing on the underlying protocols or standards, or the definition of a URI/URL.
(and yeah, file: URLs most have three but lets ignore that for now)
Actually there is one exception, but just ignore this widly 'unused' operating system which doesn't seem to care about posix and rfc's anyway.
A file URL takes the form:
where <host> is the fully qualified domain name of the system on
which the <path> is accessible, and <path> is a hierarchical
directory path of the form <directory>/<directory>/.../<name>.
... elements may be preceded
with <n>* to designate n or more repetitions of the following
element; n defaults to 0.
fileurl = "file://" [ host | "localhost" ] "/" fpath
fpath = fsegment *[ "/" fsegment ]
fsegment = *[ uchar | "?" | ":" | "@" | "&" | "=" ]
This is also why Firefox insists that there are five consecutive slashes when encoding a UNC string (a Windows/Samba share path, like "\\server\share\path\to\file.txt") into a URL (e.g.: "file://///server/share/path/to/file.txt")
And let us not even begin on the fact that file: URLs are specified in an "obsolete" spec, and explicitly allow characters in the path that are disallowed by RFC 3986. I'm working on fixing that [https://tools.ietf.org/html/draft-ietf-appsawg-file-scheme] but it takes time.
That name alone should tell you that they don't give a hoot about anything other than WWW and HTTP.