Hacker News new | past | comments | ask | show | jobs | submit login
My url isn't your url (haxx.se)
217 points by dosshell on May 11, 2016 | hide | past | favorite | 43 comments

It seems nobody really understands how to handle non-ascii URLs. To get it right for Scrapy (i.e. do same as browsers) we checked how several browsers behave; the results was consistent but surprising, if not crazy. Before sending an URL to a server (e.g. when user clicks a link) browser does the following:

1. Url is parsed into components.

2. Domain name is encoded using punycode.

3. URL path is encoded to UTF-8 and then percent-escaped.

4. Query string and fragment are encoded using page encoding and then percent-escaped. For URLs entered to address bar it seems that UTF-8 is used, not page encoding.

5. The result is joined back to a single byte string and sent to a server.

So this means parts of a single URL are encoded using 3 different encodings, and one needs to know the encoding of a page URL was extracted from to send it correctly (i.e. to send it the way browsers do). I haven't seen this algorithm stated explicitly anywhere, but this is how all browsers work. It doesn't make any sense.

However, the JS gettters are all over the place...

Given this:

    a = document.createElement('a')
    a.href = "/ö/o?ö/o#ö/o"
    console.log(a.hash, a.pathname, a.search)
we get:

    Browser  .hash       .pathname   .search 
    Chrome   #ö/o        /%C3%B6/o   ?%C3%B6/o  
    Firefox  #%C3%B6/o   /%C3%B6/o   ?%C3%B6/o  
    Safari   #%C3%B6/o   /%C3%B6/o   ?%C3%B6/o  
    Edge     #ö/o        /ö/o        ?ö/o  
    IE11     #ö/o        ö/o (ouch)  ?ö/o  
Likewise for domains, some UAs return them Punycoded (but don't provide a decoder), some in Unicode, and PhantomJS passes it percent encoded...

It's a big mess.

I discovered this while working on a SPA router that can handle routes in either `pathname`, `search` or `hash` mode... The routes are defined as the keys of a JS object, in Unicode in the source, and we have to juggle back and forth to match what `document.location.*` returns.

It looks like Developer tools also use these JS utilities, so you can't trust Network tab when checking what URL is sent to a remote server; to check what's being sent one have to either inspect network traffic or check it server side.

In theory, this is specified in section 3.1 of https://www.ietf.org/rfc/rfc3987.txt (including the "MAY" section beginning "Systems accepting IRIs MAY convert the ireg-name component of an IRI as follows"). That covers the distinction between the domain and the rest of the URL.

The use of the page encoding is part of the browser behavior, not any RFC. It has been retroactively specified in https://url.spec.whatwg.org/#url-parsing

Thanks for the references!

> For URLs entered to address bar it seems that UTF-8 is used, not page encoding.

Maybe I'm misunderstanding, but how could a browser use page-encoding before the page has been fetched?

It uses encoding of a page URL is extracted from, not of a page URL leads to.

And how would you deal with an URL to the HTML itself?

There are 3 ways to get an URL:

a) URL is entered to the address bar. Default encoding (it seems UTF-8) is used for query and fragment. b) User clicks on a link. Page URL is used; page encoding is already known. c) Redirect to a new URL happens. I can't say out of my head what happens in this case.

This is just a recipe for massive confusion though.

You're saying if I click on a link with an IRI in a non-UTF8 html page, it means one thing.

But if I copy and paste that link into a text file, and later paste it into a browser, it'll mean something else entirely? Or if I save it to a bookmark using some service?

Yeah, it is a recipe for confusion. Browsers use a workaround: when you copy an URL of a link (using "Copy Link Address" in Chrome or "Copy Link Location" in Firefox) it copies escaped URL, not original URL.

> Query string and fragment are encoded using page encoding and then percent-escaped. For URLs entered to address bar it seems that UTF-8 is used, not page encoding.

Well, for links that appear in a page to be defined using page encoding doesn't sound so shocking. That doesn't explain why the path would use UTF-8 regardless of page encoding, though.

Are you sure URLs directly entered in the address bar use UTF-8 for query strings and fragments rather than system encoding? If my OS is set to use GB2312, how do URL fragments that I type into the browser get encoded?

A good question; I'm not sure UTF-8 is always used because systems we tested were all using UTF-8 system encoding.

This question was not relevant for Scrapy because it makes sense to use UTF-8 as default regardless of what browsers do when page encoding is unknown; unlike encoding of URLs extracted from HTML this UTF-8 default doesn't lead to incompatible behavior.

The not making sense bit is a consequence of several waves of changes, some of them not really involving standards.

The first thing to be added here, I expect, was the handling of the query string. This was needed to allow non-ASCII things to be submitted via forms; the need for this arose before there was any support for non-ASCII domain names, and likely before there was any real need for non-ascii paths.

The issue with forms as they were initially created is that they would submit to some server and the server would then process the query string. In doing so it would do whatever it did for non-ASCII stuff; there was no standard. In practice, it would percent-unescape and then treat the bytes as being in whatever encoding the server developer happened to default to. Typically this was the encoding the web page was in as well. Yes, this is totally busted if your form has a name field and the name being input is not representable in ISO-8859-1 or whatever you authored your page in, but this was an incredibly common way to handle non-ASCII in the 90s. This all predates my involvement in browsers, so I'm not sure which is the chicken and which is the egg here, but the upshot was that browsers started sending the query string in the page encoding (with various hacks that were not interoperable for a while in cases when the text was not representable in that encoding) and servers started depending on this behavior. Then the accept-charset attribute got added to <form> to allow overriding the default behavior for people who wanted something else. The behavior of form submission via the query string in terms of encodings is specified at https://html.spec.whatwg.org/multipage/forms.html#picking-an... and https://url.spec.whatwg.org/#concept-urlencoded-serializer and https://html.spec.whatwg.org/multipage/forms.html#url-encode... which bridges them.

I think the next step was url paths; for those browsers were inconsistent for a bit about whether UTF-8 was used or whether the page encoding was used, but eventually all aligned on UTF-8, and this got standardized in the HTML spec as well, I'm fairly certain.

And for hostnames, punycode encoding was more or less a must for doing DNS resolution. And at that point that ended up getting sent on the wire as well. https://tools.ietf.org/html/rfc2616#section-14.23 defines the Host header, which is presumably what we're talking about here, since that's the only way the hostname is sent to the server (note that this is NOT part of the same byte string as the path and query). Following the breadcrumbs for that through https://tools.ietf.org/html/rfc2396 it all talks about this header sending the DNS name, which is why I assume browsers settled on sending punycode here. %-encoding in hostnames wasn't really a thing for a while, I think; e.g. Firefox didn't even support it until about 6 months ago.

nice, this explains it!

That's some pretty useful documentation there. I hope that it at least forms part of a submission to the WHATWG folks.

This seems to fall into the common trap of misunderstanding the causality in standardising legacy features of the web platform. The post says:

> The [WHATWG] spec says so because browsers have implemented the spec.

But that's exactly wrong. The WHATWG spec is based on the observed behaviour of browsers. By documenting what existing browsers do, and scoping out who is prepared to change their implementation when there are differences, it tries to specify a behaviour that is either already implemented or able to be implemented with as few changes as possible. It is possible to specify behaviour that is stricter than any current implementation, but the burden of proof for doing so is non-trivial; you have to demonstrate, to the satisfaction of all involved parties, that the proposed change will not break enough web content to cause a compatibility problem. Doing that is non-trivial, but possible if you are sufficiently motivated.

The complaint that the WHATWG spec is written by a closed group also seems like a misunderstanding. Compared to pay-to-play organisatons like the W3C, or pay-to-meet organisations like IETF, there is no concept of formal membership, or need to pay to get your voice heard. There's a mailing list, an open GitHub repository and an IRC channel. Anyone, especially prospective implementors of the spec, is welcome to participate in those forums, and they will have the same privileges as any other contributor. I imagine that for URL in particular, contributions from the CURL comminity would be most welcome.

It is good to have clarity on how WHATWG is working. Thankyou! Sounds like an HTML5 style treatment of the problem.

I'm not clear though why the effective starting point is the superset of all variants currently implemented by one of the involved parties. Given that the extant standards are quite strict, couldn't an argument be made that the "burden of proof" would be on why these needed to be extended, on a case by case basis?

Regardless of the particular agreements reached, an up-to-date standard will certainly benefit everyone; it seems like it would be an excellent outcome if the standard came with a reference implementation too :-)

It's not exactly the case that it's the superset of all variants currently implemented. In fact that doesn't really make sense; two implementations can just have contradictory behaviour so there is no superset (well, unless you allowed things like picking one of the variants at random, I suppose ;)

In fact the process is iterative. When you find that implementations do different things, you make an educated guess about what's most likely to be compatible with existing content. For example if all browsers allow an arbitrary number of slashes between the scheme and the host part of a URL, it seems very likely that some existing content relies on that, and very unlikely that you are going to get all the browsers to change their implementation, so you standardise that behaviour. On the other hand if you find that, say, one browser allows a semicolon rather than a colon after the scheme, but no other implementation does, it's very unlikely that is required for compatibility, so you typically don't allow colon-after-scheme in the spec, and speak to the existing implementation about fixing their bug.

The hard case is where there isn't a clear consensus about the required behaviour, especially when implementations are flatly contradictory. In that case you speak to people and find out whether they have some evidence that their current behaviour has good effects, and whether they are willing to change. Armed with this, and your ability to reason about what should be more compatible or otherwise desirable, you pick one behaviour and hope that people converge on it. Often, in these cases, time will pass and it will be clear that you made the wrong choice; in that case you go back and edit the spec to reflect whatever reality turned out to be.

So the process is iterative and not perfectly algorithmic; the person doing the work has to apply their judgement to make decisions where there isn't a clear path forward. But the goal, at least, is clear; it's to have a single specification that is in-line with what's actually implemented in the real world, and what's needed to consume real-world content. In this sense the URL spec is already a success; for example it was used as the basis for the Rust url library, which is believed to be compatible enough to consider using not just in Servo, but also in Firefox in place of the legacy C++ parser.

Browser vendors will be hesitant to remove any special variant that might break existing websites.

The number of websites with this issue is so small that in my 9 years of crawling the web, I'd never noticed the issue.

Why do you think you would notice it? I imagine the only giveaway would be the url in status bar when you hover mouse over the link. If you are a programmer (or very pedant person) you might notice it, otherwise probably not.

Also, I hope they acted on some real-world statistics, provided by Google's or other crawlers...

I was the CTO of a web-scale search engine, and now I work on the Internet Archive's Wayback Machine. I'm not referring to things I noticed as a web end-user.

When the robustness principle meets the tragedy of the commons, good people get driven to the brink of madness.

There are surely arguments for allowing any old garbage in URLs/URIs/IRIs - hey, it's easy and fun! - but, gee, it's not like bunging an identifier through a URI escape function is going to triple the code base.

Compare that with the surface area for potential bugs in the parsing code which somehow didn't account for several megabytes of whitespace embedded in a URL or multiple code pages in a single string or whatever zalgoesque horror is showing up on the security bulletins this week. It's a lot easier to proof code that only needs to deal with a strict subset of 7-bit clean ASCII and can politely decline embedded emojis.

When one web client starts being overly zealous in what it accepts, that puts an implicit onus on everybody else to start accepting that too ("all browsers do"!). Where would you draw the line? Well, we've got a couple of RFCs lying around, how about we go with those.

This stuff doesn't have to be hard. Surely the act of issuing an HTTP redirect isn't the Last Great Unsolved Problem of web engineering! I say rejoice in the beauty of URIs in canonical form. Be miserly in what you accept for a change.

Most of the world is outside the (English-speaking) United States of America and doesn't speak an ASCII language. That's why Unicode exists - couldn't we all just use it? That's what the IRI RFC says in essence. EDIT: Perhaps I misread and you're saying we need two layers like the RFCs intended it: the user-visible Unicode IRIs and the protocol-level ASCII URIs. cURL lives between these two and arguably would be simpler if there was no such separation and everything was Unicode to begin with.

Agreed. I love me all the Unicodes, all the time! For typing, for display, for whatever internal app logic, variable names, the names of children etc.

The IRI RFC does however specify that for transmission/interchange the IRIs need to be percent-encoded to form valid URIs. That's pretty straightforward! A good user agent will fully transcode whatever characters I type in the address bar, generating a valid IRI. A bad user agent will complain and say that whatever it was I typed was very nice but not a valid interweb, and I have a bad day. A broken user agent may generate some broken encoding (or just pass on whatever I typed in UTF-8 or CP-1252), and some web admins may have a bad day, if they weren't already.

But the location bar is not the only place that URLs come from (less and less every day). When a web application (e.g. 301 redirect) or a resource (e.g. external document reference) uses an RFC-3987 noncompliant identifier, just how far should the user-agent bend over backwards to fetch the resource? Some browsers seem to do a lot, but it's not clearly documented just how much (see elsewhere in the comments), and among the article's laments was the observation that when any browser accepts dodgy identifiers, it places a burden on everyone else to accept the same.

Seems that WHATWG hopes to formally codify just how dodgy you can get away with without the identifier being "too dodgy", but I get the impression that this will be an expansive definition ("hey, if they did it, you can do it! we believe in you"). This is an understandable approach, as they are trying to document the state of the web, but it does put extra burden onto anyone dealing methodically with URLs henceforth, rather than being strict and drawing some clear boundaries up for people constructing URLs. I'm expecting to see a lot of "SHOULD" and not a lot of "MUST NOT" :-/

Sure! Which unicode? Utf-8? -16? WTF-8 to allow for robustness in parsing?

Mildly amused that the website of the guy who makes cURL wasn't ready for the HN hug of death.

It seems to me that much of the article can be boiled down to the fact that the standards are being set by people writing software that

a) has to be usable by users with the lowest common denominator of tech savvy and

b) doesn't have to parse URLs out of other text because it encounters URLs only in defined fields

As a result, this software is written to be lenient about trying to interpret URLs input by the user. So it already has to have facilities to transform URLs with improperly encoded special characters such as spaces. This allows it to in turn be more lenient about what it accepts from servers.

With no clear standard for these transformations and the browser makers not sharing their parsing and transformation libraries, the writers of other software that has to deal with URLs are left playing catch-up and trying to reverse-engineer the browsers' undocumented hacks for dealing with not-very-well-formed URLs.

Moreover, many have the additional burden of having to be able to parse URLs out of other text where the delineation of URL start/end may not be clear. Some of the forms supported by browser makers (who don't have to do this) can't reasonably be supported in this context, leading to frustration for developers (let alone users).

There's already a solution in the spec for the slash issue

Either you have a valid scheme or you don't, and either you have a relative path or an absolute path

Double slash after a valid scheme implies an authority, if a 3rd slash exists it's an absolute path, anything past the 3rd slash has to be normalized. So, http:////// is actually http:///

If you have an invalid scheme or a relative path such as http:/ or +http:// it would actually be normalized to ./http:/ ./+http:/


curl gets confused by non-ASCII letters in the path part but percent encodes such byte values in the outgoing requests – which causes “interesting” side-effects when the non-ASCII characters are provided in other encodings than UTF-8 which for example is standard on Windows…

I don't think cURL implements this percent encoding yet - instead, it sends out binary paths on UTF-8 locale and Linux likewise. This was recently added in cURL's TODO though: https://curl.haxx.se/docs/todo.html#Add_support_for_IRIs


> The term URL was later changed to become URI, Uniform Resource Identifiers


> A URI can be further classified as a locator, a name, or both. The term "Uniform Resource Locator" (URL) refers to the subset of URIs that, in addition to identifying a resource, provide a means of locating the resource by describing its primary access mechanism

Personally, I don't treat errors like that as massively damaging to the credibility of the rest of the article, but it is rather jarring, and I know that people do post "Stopped reading at..." comment about this sort of thing.

Eh, I think your comment only confirms the main point of the article, that it's become massively confusing.

I was around around then, and definitely recall the conventional wisdom being "don't call them URLs anymore, call them URIs, that's the new standard." At any rate, rfc3686 claimed to be a new standard for _syntax_ -- according to the passage you quote, whether it's a "URL" or a non-URL URI is not a matter of syntax at all. The syntax spec is now called "URI". That rfc claims.

Except somehow the result has been just massive confusion over what standard should be used how for syntax of URLs/URIs, with the WHATWG just stirring the pot further (obligatory reference to xkcd there's too many standards let's create another). I don't know that there's anything wrong or to blame in the written text of rfc3986, but something in the social practice of rfc standards as manifested in URL/URI-related things, has resulted in massive confusion and a situation described in OP where there is effectively no formal standard for URLs/URIs consistently used across software.

Part of the messyness of URLs might stem from the fact that they occur in a lot of different contexts, with very different properties and reserved characters:

- Somewhere within some unstructured text: Here, finding the exact end of the URL is difficult and disallowing spaces can have a great benefit. Even then you can have ambiguities: Does punctuation at the end belong to the URL? What about URLs that are wrapped in <> brackets?

Whether or not it's even possible that the URL contains non-ascii chars depends on the type of the document.

- As a value of a HTML attribute without surrounding quotes: Slightly less ambiguous but including a space would break the HTML parser. The parser works fine with unicode characters however, so there is less pressure for authors to encode them.

- As a value of a quoted HTML or XML attribute, HTTP header or data field: Charset, Start and end of the URL are communicated out-of-band so there is no pressure at all to normalize the URL.

- In the browser address bar: Start and end are unambiguous but there are other concerns: You have to detect if it's an actual URL or a search term, you have to deal with omitted parts (e.g. no scheme) and people WILL input unicode characters. You also want to display unicode characters to the user, except when you don't want to for security reasons. And then users will copy the URL in the address bar and paste them into an HTML document...

I think we should discount The Widget Formerly Known As The Address Bar from these sorts of discussions. The omnibox[chrome]/awesomebar[ff]/frustrationmaker[edge]/etc. is a field where a human can type or paste all sorts of guff, and where the browser can present the human with a representation of the location of the cat video currently playing on the screen (complete with optional scheme, padlock icon, elided path elements, fancy colours and shading, Unicode snowmen, etc.)

Sometimes the guff from the human looks exactly the same as a URL, sometimes it's close enough to work out what they meant, and the rest of the time it's just guff. If it looks exactly like a URL or IRI, it's trivial for the browser to turn it into a proper URI and emit an appropriate request. If it's close, then some assumptions can be made, but the browser is still responsible for constructing the actual URI. That doesn't mean that what the human typed has any bearing on the underlying protocols or standards, or the definition of a URI/URL.

    (and yeah, file: URLs most have three but lets ignore that for now)
I don't want to be nitpicky but actually a file url still has two slashes, the third slash is actually the root path i.e.: file:///home where the two slashes are slash-slash-colon and there third is actually the start of the path. (actually it just emit's the host)

Actually there is one exception, but just ignore this widly 'unused' operating system which doesn't seem to care about posix and rfc's anyway.

Actually, no. From RFC 1738, Section 3.10:

   A file URL takes the form:
   where <host> is the fully qualified domain name of the system on
   which the <path> is accessible, and <path> is a hierarchical
   directory path of the form <directory>/<directory>/.../<name>.
And then in Section 5 (the BNF):

   ... elements may be preceded
   with <n>* to designate n or more repetitions of the following
   element; n defaults to 0.
   fileurl        = "file://" [ host | "localhost" ] "/" fpath
   fpath          = fsegment *[ "/" fsegment ]
   fsegment       = *[ uchar | "?" | ":" | "@" | "&" | "=" ]
So that first slash isn't part of the path, and the path starts with a directory name. The fact that the root directory in a UNIXy environment doesn't have a name (or has a zero-length name) is a source of much confusion. (Nobody would accept "file:////etc/passwd" as a URL, but that's how I read the spec.)

This is also why Firefox insists that there are five consecutive slashes when encoding a UNC string (a Windows/Samba share path, like "\\server\share\path\to\file.txt") into a URL (e.g.: "file://///server/share/path/to/file.txt")

And let us not even begin on the fact that file: URLs are specified in an "obsolete" spec, and explicitly allow characters in the path that are disallowed by RFC 3986. I'm working on fixing that [https://tools.ietf.org/html/draft-ietf-appsawg-file-scheme] but it takes time.

punycode, that is used to encode unicode urls is also used for support of unicode characters in DNS.

The forking of the URL standard has been indeed annoying. It's a standards fail.

... now try handling BIDI IRI

> The Web Hypertext Application Technology Working Group

That name alone should tell you that they don't give a hoot about anything other than WWW and HTTP.

Applications are open for YC Winter 2023

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact