
My url isn't your url - dosshell
https://daniel.haxx.se/blog/2016/05/11/my-url-isnt-your-url/
======
kmike84
It seems nobody really understands how to handle non-ascii URLs. To get it
right for Scrapy (i.e. do same as browsers) we checked how several browsers
behave; the results was consistent but surprising, if not crazy. Before
sending an URL to a server (e.g. when user clicks a link) browser does the
following:

1\. Url is parsed into components.

2\. Domain name is encoded using punycode.

3\. URL path is encoded to UTF-8 and then percent-escaped.

4\. Query string and fragment are encoded using _page encoding_ and then
percent-escaped. For URLs entered to address bar it seems that UTF-8 is used,
not page encoding.

5\. The result is joined back to a single byte string and sent to a server.

So this means parts of a single URL are encoded using 3 different encodings,
and one needs to know the encoding of a page URL was extracted from to send it
correctly (i.e. to send it the way browsers do). I haven't seen this algorithm
stated explicitly anywhere, but this is how all browsers work. It doesn't make
any sense.

~~~
oneeyedpigeon
> For URLs entered to address bar it seems that UTF-8 is used, not page
> encoding.

Maybe I'm misunderstanding, but how could a browser use page-encoding before
the page has been fetched?

~~~
wolfgke
That's how:
[https://html.spec.whatwg.org/multipage/syntax.html#determini...](https://html.spec.whatwg.org/multipage/syntax.html#determining-
the-character-encoding)

~~~
ingenter
And how would you deal with an URL to the HTML itself?

~~~
kmike84
There are 3 ways to get an URL:

a) URL is entered to the address bar. Default encoding (it seems UTF-8) is
used for query and fragment. b) User clicks on a link. Page URL is used; page
encoding is already known. c) Redirect to a new URL happens. I can't say out
of my head what happens in this case.

~~~
jrochkind1
This is just a recipe for massive confusion though.

You're saying if I click on a link with an IRI in a non-UTF8 html page, it
means one thing.

But if I copy and paste that link into a text file, and later paste it into a
browser, it'll mean something else entirely? Or if I save it to a bookmark
using some service?

~~~
kmike84
Yeah, it is a recipe for confusion. Browsers use a workaround: when you copy
an URL of a link (using "Copy Link Address" in Chrome or "Copy Link Location"
in Firefox) it copies escaped URL, not original URL.

------
jgraham
This seems to fall into the common trap of misunderstanding the causality in
standardising legacy features of the web platform. The post says:

> The [WHATWG] spec says so because browsers have implemented the spec.

But that's exactly wrong. The WHATWG spec is based on the observed behaviour
of browsers. By documenting what existing browsers do, and scoping out who is
prepared to change their implementation when there are differences, it tries
to specify a behaviour that is either already implemented or able to be
implemented with as few changes as possible. It is _possible_ to specify
behaviour that is stricter than any current implementation, but the burden of
proof for doing so is non-trivial; you have to demonstrate, to the
satisfaction of all involved parties, that the proposed change will not break
enough web content to cause a compatibility problem. Doing that is non-
trivial, but possible if you are sufficiently motivated.

The complaint that the WHATWG spec is written by a closed group also seems
like a misunderstanding. Compared to pay-to-play organisatons like the W3C, or
pay-to-meet organisations like IETF, there is no concept of formal membership,
or need to pay to get your voice heard. There's a mailing list, an open GitHub
repository and an IRC channel. Anyone, especially prospective implementors of
the spec, is welcome to participate in those forums, and they will have the
same privileges as any other contributor. I imagine that for URL in
particular, contributions from the CURL comminity would be most welcome.

~~~
tfm
It is good to have clarity on how WHATWG is working. Thankyou! Sounds like an
HTML5 style treatment of the problem.

I'm not clear though why the effective starting point is the _superset_ of all
variants currently implemented by one of the involved parties. Given that the
extant standards are quite strict, couldn't an argument be made that the
"burden of proof" would be on why these needed to be extended, on a case by
case basis?

Regardless of the particular agreements reached, an up-to-date standard will
certainly benefit everyone; it seems like it would be an excellent outcome if
the standard came with a reference implementation too :-)

~~~
majewsky
Browser vendors will be hesitant to remove any special variant that might
break existing websites.

~~~
greglindahl
The number of websites with this issue is so small that in my 9 years of
crawling the web, I'd never noticed the issue.

~~~
Drdrdrq
Why do you think you would notice it? I imagine the only giveaway would be the
url in status bar when you hover mouse over the link. If you are a programmer
(or very pedant person) you might notice it, otherwise probably not.

Also, I hope they acted on some real-world statistics, provided by Google's or
other crawlers...

~~~
greglindahl
I was the CTO of a web-scale search engine, and now I work on the Internet
Archive's Wayback Machine. I'm not referring to things I noticed as a web end-
user.

------
tfm
When the robustness principle meets the tragedy of the commons, good people
get driven to the brink of madness.

There are surely arguments for allowing any old garbage in URLs/URIs/IRIs -
hey, it's easy and fun! - but, gee, it's not like bunging an identifier
through a URI escape function is going to triple the code base.

Compare that with the surface area for potential bugs in the parsing code
which somehow didn't account for several megabytes of whitespace embedded in a
URL or multiple code pages in a single string or whatever zalgoesque horror is
showing up on the security bulletins this week. It's a lot easier to proof
code that only needs to deal with a strict subset of 7-bit clean ASCII and can
politely decline embedded emojis.

When one web client starts being overly zealous in what it accepts, that puts
an implicit onus on everybody else to start accepting that too ("all browsers
do"!). Where would you draw the line? Well, we've got a couple of RFCs lying
around, how about we go with those.

This stuff doesn't have to be hard. Surely the act of issuing an HTTP redirect
isn't the Last Great Unsolved Problem of web engineering! I say rejoice in the
beauty of URIs in canonical form. Be miserly in what you accept for a change.

~~~
tuukkah
Most of the world is outside the (English-speaking) United States of America
and doesn't speak an ASCII language. That's why Unicode exists - couldn't we
all just use it? That's what the IRI RFC says in essence. EDIT: Perhaps I
misread and you're saying we need two layers like the RFCs intended it: the
user-visible Unicode IRIs and the protocol-level ASCII URIs. cURL lives
between these two and arguably would be simpler if there was no such
separation and everything was Unicode to begin with.

~~~
tfm
Agreed. I love me all the Unicodes, all the time! For typing, for display, for
whatever internal app logic, variable names, the names of children etc.

The IRI RFC does however specify that for transmission/interchange the IRIs
need to be percent-encoded to form valid URIs. That's pretty straightforward!
A good user agent will fully transcode whatever characters I type in the
address bar, generating a valid IRI. A bad user agent will complain and say
that whatever it was I typed was very nice but not a valid interweb, and I
have a bad day. A broken user agent may generate some broken encoding (or just
pass on whatever I typed in UTF-8 or CP-1252), and some web admins may have a
bad day, if they weren't already.

But the location bar is not the only place that URLs come from (less and less
every day). When a web application (e.g. 301 redirect) or a resource (e.g.
external document reference) uses an RFC-3987 noncompliant identifier, just
how far should the user-agent bend over backwards to fetch the resource? Some
browsers seem to do a lot, but it's not clearly documented just how much (see
elsewhere in the comments), and among the article's laments was the
observation that when any browser accepts dodgy identifiers, it places a
burden on everyone else to accept the same.

Seems that WHATWG hopes to formally codify just how dodgy you can get away
with without the identifier being "too dodgy", but I get the impression that
this will be an expansive definition ("hey, if they did it, you can do it! we
believe in you"). This is an understandable approach, as they are trying to
document the state of the web, but it does put extra burden onto anyone
dealing methodically with URLs henceforth, rather than being strict and
drawing some clear boundaries up for people constructing URLs. I'm expecting
to see a lot of "SHOULD" and not a lot of "MUST NOT" :-/

------
sleepychu
Page seems to be struggling.

[http://webcache.googleusercontent.com/search?q=cache:https:/...](http://webcache.googleusercontent.com/search?q=cache:https://daniel.haxx.se/blog/2016/05/11/my-
url-isnt-your-url/&num=1&strip=1&vwsrc=0)

~~~
oxguy3
Mildly amused that the website of the guy who makes cURL wasn't ready for the
HN hug of death.

------
cauterized
It seems to me that much of the article can be boiled down to the fact that
the standards are being set by people writing software that

a) has to be usable by users with the lowest common denominator of tech savvy
and

b) doesn't have to parse URLs out of other text because it encounters URLs
only in defined fields

As a result, this software is written to be lenient about trying to interpret
URLs input by the user. So it already has to have facilities to transform URLs
with improperly encoded special characters such as spaces. This allows it to
in turn be more lenient about what it accepts from servers.

With no clear standard for these transformations and the browser makers not
sharing their parsing and transformation libraries, the writers of other
software that has to deal with URLs are left playing catch-up and trying to
reverse-engineer the browsers' undocumented hacks for dealing with not-very-
well-formed URLs.

Moreover, many have the additional burden of having to be able to parse URLs
out of other text where the delineation of URL start/end may not be clear.
Some of the forms supported by browser makers (who don't have to do this)
can't reasonably be supported in this context, leading to frustration for
developers (let alone users).

------
jacksonsabey
There's already a solution in the spec for the slash issue

Either you have a valid scheme or you don't, and either you have a relative
path or an absolute path

Double slash after a valid scheme implies an authority, if a 3rd slash exists
it's an absolute path, anything past the 3rd slash has to be normalized. So,
[http://////](http://////) is actually [http:///](http:///)

If you have an invalid scheme or a relative path such as http:/ or
+[http://](http://) it would actually be normalized to ./http:/ ./+http:/

[https://tools.ietf.org/html/rfc3986#section-4.2](https://tools.ietf.org/html/rfc3986#section-4.2)

------
tuukkah
_curl gets confused by non-ASCII letters in the path part but percent encodes
such byte values in the outgoing requests – which causes “interesting” side-
effects when the non-ASCII characters are provided in other encodings than
UTF-8 which for example is standard on Windows…_

I don't think cURL implements this percent encoding yet - instead, it sends
out binary paths on UTF-8 locale and Linux likewise. This was recently added
in cURL's TODO though:
[https://curl.haxx.se/docs/todo.html#Add_support_for_IRIs](https://curl.haxx.se/docs/todo.html#Add_support_for_IRIs)

------
frobozz
FTA:

> The term URL was later changed to become URI, Uniform Resource Identifiers

rfc3986:

> A URI can be further classified as a locator, a name, or both. The term
> "Uniform Resource Locator" (URL) refers to the subset of URIs that, in
> addition to identifying a resource, provide a means of locating the resource
> by describing its primary access mechanism

Personally, I don't treat errors like that as massively damaging to the
credibility of the rest of the article, but it is rather jarring, and I know
that people do post "Stopped reading at..." comment about this sort of thing.

~~~
jrochkind1
Eh, I think your comment only confirms the main point of the article, that
it's become massively confusing.

I was around around then, and definitely recall the conventional wisdom being
"don't call them URLs anymore, call them URIs, that's the new standard." At
any rate, rfc3686 claimed to be a new standard for _syntax_ -- according to
the passage you quote, whether it's a "URL" or a non-URL URI is not a matter
of syntax at all. The syntax spec is now called "URI". That rfc claims.

Except somehow the result has been just massive confusion over what standard
should be used how for syntax of URLs/URIs, with the WHATWG just stirring the
pot further (obligatory reference to xkcd there's too many standards let's
create another). I don't know that there's anything wrong or to blame in the
written text of rfc3986, but something in the social practice of rfc standards
as manifested in URL/URI-related things, has resulted in massive confusion and
a situation described in OP where there is effectively no formal standard for
URLs/URIs consistently used across software.

------
xg15
Part of the messyness of URLs might stem from the fact that they occur in a
lot of different contexts, with very different properties and reserved
characters:

\- Somewhere within some unstructured text: Here, finding the exact end of the
URL is difficult and disallowing spaces can have a great benefit. Even then
you can have ambiguities: Does punctuation at the end belong to the URL? What
about URLs that are wrapped in <> brackets?

Whether or not it's even possible that the URL contains non-ascii chars
depends on the type of the document.

\- As a value of a HTML attribute without surrounding quotes: Slightly less
ambiguous but including a space would break the HTML parser. The parser works
fine with unicode characters however, so there is less pressure for authors to
encode them.

\- As a value of a quoted HTML or XML attribute, HTTP header or data field:
Charset, Start and end of the URL are communicated out-of-band so there is no
pressure at all to normalize the URL.

\- In the browser address bar: Start and end are unambiguous but there are
other concerns: You have to detect if it's an actual URL or a search term, you
have to deal with omitted parts (e.g. no scheme) and people WILL input unicode
characters. You also want to display unicode characters to the user, except
when you don't want to for security reasons. And then users will copy the URL
in the address bar and paste them into an HTML document...

~~~
phluid61
I think we should discount _The Widget Formerly Known As The Address Bar_ from
these sorts of discussions. The
omnibox[chrome]/awesomebar[ff]/frustrationmaker[edge]/etc. is a field where a
human can type or paste all sorts of guff, and where the browser can present
the human with a representation of the location of the cat video currently
playing on the screen (complete with optional scheme, padlock icon, elided
path elements, fancy colours and shading, Unicode snowmen, etc.)

Sometimes the guff from the human looks exactly the same as a URL, sometimes
it's close enough to work out what they meant, and the rest of the time it's
just guff. If it looks exactly like a URL or IRI, it's trivial for the browser
to turn it into a proper URI and emit an appropriate request. If it's close,
then some assumptions can be made, but the browser is still responsible for
constructing the actual URI. That doesn't mean that what the human typed has
any bearing on the underlying protocols or standards, or the definition of a
URI/URL.

------
merb

        (and yeah, file: URLs most have three but lets ignore that for now)
    

I don't want to be nitpicky but actually a file url still has two slashes, the
third slash is actually the root path i.e.: file:///home where the two slashes
are slash-slash-colon and there third is actually the start of the path.
(actually it just emit's the host)

Actually there is one exception, but just ignore this widly 'unused' operating
system which doesn't seem to care about posix and rfc's anyway.

~~~
phluid61
Actually, no. From RFC 1738, Section 3.10:

    
    
       A file URL takes the form:
       
           file://<host>/<path>
       
       where <host> is the fully qualified domain name of the system on
       which the <path> is accessible, and <path> is a hierarchical
       directory path of the form <directory>/<directory>/.../<name>.
    

And then in Section 5 (the BNF):

    
    
       ... elements may be preceded
       with <n>* to designate n or more repetitions of the following
       element; n defaults to 0.
       
       fileurl        = "file://" [ host | "localhost" ] "/" fpath
       
       fpath          = fsegment *[ "/" fsegment ]
       fsegment       = *[ uchar | "?" | ":" | "@" | "&" | "=" ]
    

So that first slash isn't part of the path, and the path starts with a
directory name. The fact that the root directory in a UNIXy environment
doesn't have a name (or has a zero-length name) is a source of much confusion.
(Nobody would accept "file:////etc/passwd" as a URL, but that's how I read the
spec.)

This is also why Firefox insists that there are five consecutive slashes when
encoding a UNC string (a Windows/Samba share path, like
"\\\server\share\path\to\file.txt") into a URL (e.g.:
"file://///server/share/path/to/file.txt")

And let us not even begin on the fact that file: URLs are specified in an
"obsolete" spec, and explicitly allow characters in the path that are
disallowed by RFC 3986. I'm working on fixing that
[[https://tools.ietf.org/html/draft-ietf-appsawg-file-
scheme](https://tools.ietf.org/html/draft-ietf-appsawg-file-scheme)] but it
takes time.

------
dbalan
punycode, that is used to encode unicode urls is also used for support of
unicode characters in DNS.

------
jrochkind1
The forking of the URL standard has been indeed annoying. It's a standards
fail.

------
slim
... now try handling BIDI IRI

------
lolidaisuki
> The Web Hypertext Application Technology Working Group

That name alone should tell you that they don't give a hoot about anything
other than WWW and HTTP.

