
The History of the URL: Path, Fragment, Query, and Auth - zackbloom
https://eager.io/blog/the-history-of-the-url-path-fragment-query-auth/?h
======
WiseWeasel
_> Given the power of search engines, it’s possible the best URN format today
would be a simple way for files to point to their former URLs. We could allow
the search engines to index this information, and link us as appropriate:_

    
    
      <!-- On http://zack.is/history -->
      <link rel="past-url" href="http://zackbloom.com/history.html">
      <link rel="past-url" href="http://zack.is/history.html">
    

This seems like a heart-warmingly naive proposal; cross-domain search result
hijacking.

~~~
zackbloom
Really good point. It would either only make sense on the same origin, or if
you authenticated the domain.

~~~
seanp2k2
How? Who do you trust for the authentication? See SSL certificates, domain
name registration, and mail servers for some ideas around how this plays out

It's not that what we have now is perfect, but the use-case of typing in
example.com or googling _example_ is indeed a low enough bar that non-
technical users get it, and re-tooling everything to support an alternate
format is a non-starter. See also: IPv6 adoption, and not many people are
typing in or googling IPs.

~~~
zackbloom
In terms of authentication I think being able to add a (very specific) piece
of content to the site or being able to add a DNS record are two common
solutions. I'm not suggesting it's viable in this case though.

------
orf
Perhaps the author has a point, but I don't think it's as bad as he puts it.
Grandma and Grandad don't need to learn UNIX paths to use the net, they just
need to know one thing: the hostname. They type in "google.com" or
"walmart.com" and that's it, not hard to grok. Perhaps they don't fully get
what .com means but they know it's needed and some sites have different ones.

You never see an advert with "visit our site at
[http://user:pass@mycompany.com/app/page1244.html?abc=def#hel...](http://user:pass@mycompany.com/app/page1244.html?abc=def#hello"),
it's just "mycompany.com". The structure of the URL's once the user has hit
the landing page and clicked a couple of links doesn't need to be known by the
user, nor should the user care.

To put it simply: if it was hard then the web wouldn't be as big as it is now.
Non-technical people understand simple URI's and that's all that's needed.

~~~
stonogo
That's not entirely true:

[http://cdn.doctorswithoutborders.org/sites/usa/files/tpp_ad_...](http://cdn.doctorswithoutborders.org/sites/usa/files/tpp_ad_for_site_0.png)
[http://i.imgur.com/HWNM1KZ.jpg](http://i.imgur.com/HWNM1KZ.jpg)
[http://www.shofforddesign.com/wp-content/uploads/Oakley-
Maga...](http://www.shofforddesign.com/wp-content/uploads/Oakley-Magazine.jpg)

This sort of thing is useful for advertising specifically. In the examples
I've given:

1) targeted information 2) short url 3) direct-to-relevant-site from the ad

In my experience, in-house or business-wide advertising tends to use just the
domain, but campaigns farmed out to marketing firms or campaigns related to
specific products tend to use path info as well as domain name.

And while I've never seen auth credentials in an advertised URL, I have
_frequently_ seen "example.com/product#platform" style urls, where #platform
(and sometimes ?platform) indicates which magazine the ad appeared in.

~~~
wtbob
> I have frequently seen "example.com/product#platform" style urls, where
> #platform (and sometimes ?platform) indicates which magazine the ad appeared
> in.

Since the fragment doesn't get sent to the server, that seems broken …

~~~
soared
You can grab anchors with google analytics

~~~
wtbob
Ouch, that's horrible — I'd no idea. Yet another reason to disable JavaScript
by default.

------
myfonj
Great article indeed. What I personally missed there was a clear statement
about segment (or path) parameters as opposed to query ("search") parameters:

    
    
        //domain.tld/segment;param1=value1,param2/segment?query=..
    

Clearly nowadays it seems a bit odd. I bumped into it while reading about this
in the "HTTP: The Definitive Guide" from 2002 chapter "2.2 URL Syntax".
Actually read this just few years ago and ever since pondered whether this
aspect was or was not really adopted by some wild CGI at the times I haven't
experienced myself.

Definitive guide apparently paraphrased RFC 2396 which clearly defined
semicolon as segment parameter delimiter, but later was obsoleted [a] by RFC
3986, moving former standard to "possible practise" [b] stating that:

> URI producing applications often use the reserved characters allowed in a
> segment to delimit scheme-specific or dereference-handler-specific
> subcomponents. For example, the semicolon (";") and equals ("=") reserved
> characters are often used to delimit parameters and parameter values
> applicable to that segment. The comma (",") reserved character is often used
> for similar purposes. For example, one URI producer might use a segment such
> as "name;v=1.1" to indicate a reference to version 1.1 of "name", whereas
> another might use a segment such as "name,1.1" to indicate the same.
> Parameter types may be defined by scheme-specific semantics, but in most
> cases the syntax of a parameter is specific to the implementation of the
> URI's dereferencing algorithm. [c]

[a] would't have noticed without:
[http://stackoverflow.com/questions/6444492/can-any-path-
segm...](http://stackoverflow.com/questions/6444492/can-any-path-segments-of-
a-uri-have-a-query-component) [b]
[https://tools.ietf.org/html/rfc3986#section-3.3](https://tools.ietf.org/html/rfc3986#section-3.3)

~~~
kiwidrew
Yes, path segment parameters are a rather interesting (but mostly historical)
curiosity.

Recently I have begun using them in a few of my internal APIs, where it is
useful to have a distinction between "parameters used to filter a set of
items" and "parameters used to request items in a particular format"... an
example:

    
    
        GET /v1/tickets;status=closed;milestone=100?fields=id,title,reporter&sort=-last_updated
    

Pro: this nicely addresses the problem of collisions between the names of data
fields (such as "title") and the names of response-modifying parameters (such
as "sort").

Con: consumers of your API get to discover the joys of RFC 3986, because there
is essentially zero support for path segment parameters in modern HTTP
libraries.

------
scrollaway
Part 1: [https://eager.io/blog/the-history-of-the-url-domain-and-
prot...](https://eager.io/blog/the-history-of-the-url-domain-and-protocol/)

Also previously on HN: [https://eager.io/blog/the-languages-which-almost-were-
css/](https://eager.io/blog/the-languages-which-almost-were-css/)

Both excellent reads.

------
bduerst
Just a random URL fact: the TLDs for domains could technically be used as
domains as well, if ICANN ever allowed it.

It wouldn't make sense for TLDs like com, net, org, etc., but for trademarked
TLDs like barclays, youtube, pwc, etc. visitors could essentially go straight
to that webpage with the TLD, like [https://youtube](https://youtube)

~~~
theandrewbailey
Generic TLDs have mostly solved that.

~~~
digi_owl
If you are thinking about .xyz and similar, you still have to have another
domain below those. you can't simply register xyz and have it point directly
at a server.

~~~
Symbiote
Sure you can: [http://ai./](http://ai./) (there are other CCTLDs doing this).

Only one new TLD does this, and the page is empty:
[http://мон./](http://мон./) or [http://xn--l1acc./](http://xn--l1acc./)

~~~
colejohnson66
And now that Google owns the TLD `google' with their whole com.google thing a
while back, what's to stop them from setting up
[https://google./](https://google./)? Is there any technical reason? Do you
have to go through ICANN?

~~~
djsumdog
I think you're mistaken. There isn't a TLD for google. Under some of the ICANN
proposals, that would have been possible, but to date I don't think it's been
done.

~~~
colejohnson66
No, there is. [https://com.google./](https://com.google./) redirects to
[https://google.com/](https://google.com/)

[http://www.theregister.co.uk/2014/11/26/google_turns_on_goog...](http://www.theregister.co.uk/2014/11/26/google_turns_on_google_internet_extension/)

[https://icannwiki.com/.google](https://icannwiki.com/.google)

~~~
scrollaway
They officially host their domain registration website on domains.google, too:

[https://domains.google/](https://domains.google/)

------
userbinator
IMHO the arguments given in the quotes in the article are not very compelling,
in view of the fact that humans have been identifying physical locations with
far more complex and inconsistent systems for literally _centuries_ :

[https://news.ycombinator.com/item?id=8907301](https://news.ycombinator.com/item?id=8907301)

Compared to "real-life addresses", URLs are an absolute pleasure to handle and
understand; which naturally raises the question of why so many people seem to
have trouble, or are suggesting that others do, with URLs? It's just a
"virtual" address, in an easily-parseable format for identifying a location in
"cyberspace". Perhaps its the "every time you put an intelligent person in
front of a computer, his/her brain suddenly disappears" syndrome (for lack of
a better word)?

The advocacy of search engines instead of URLs is also not such a great idea;
sadly, search engines like Google today do not work like a 'grep' that lets
you find _exactly_ what you're looking for, do not index every page (or allow
you to see every result, which is somewhat equivalent), and the results are
also strongly dependent upon some proprietary ranking algorithm which others
have little to no control over. If relying on links and having them disappear
is bad, relying on SERPs is even worse since they are _far_ more dynamic and
dependent on many more external factors which may even include things like
_which country your IP says you 're from when you search_.

Search engines are definitely useful for finding things, but as someone who
has a collection of links gathered over many years, most of which are still
alive and yet Google does not acknowledge the existence of _even when the URL
is searched_ , I am extremely averse to search engines as a sort of
replacement for URLs.

~~~
zackbloom
The simple answer is humans make errors. Even mailing a letter the post office
often applies corrections to written addresses which contain mistakes. This is
only possible because the address system is relatively sparse and over-
constrained.

For example:

Bob Smith 162 Portsmouth St Denver, CO 42348

Encodes the city in both the zip and city / state, allowing either to be
wrong. If 162 doesn't exist on that street, it's possible they can correct it
based on the name. If it's Portsmouth Ave not St, they likely can figure that
out.

None of that flexibility exists in URLs. A single mistyped character or
misspelling and you are going to the wrong place with no way of getting where
you want (without search engines).

It's a system which works very well for machines, and only passably well for
error-prone humans.

~~~
x1798DE
This is a somewhat compelling reason why people don't _use_ URLs like they do
addresses (that plus the fact that physical locations are "discoverable", so
if you know something is in the general vicinity of 40th street you can
probably wander around a bit to find it).

But I don't know if it explains why people don't _understand_ URLs. Physical
location addresses are no more complicated than URLs (with the exception of
URL-encoded blobs in URLs, which are not usually human-readable). That said, I
don't usually talk about this stuff with non-tech types, so maybe the average
person does understand URLs more or less.

------
shmerl
_> rendering any advertisement which still contains them truly ridiculous_

I don't find it ridiculous, since despite gopher going out of existence, and
ftp being a minority, http vs https distinction is quite important until the
present day, especially considering that redirect form http to https can be
insecure and the proper way to open many sites is explicitly with
[https://](https://)

It might get fixed in the future, but it didn't happen yet.

~~~
zackbloom
We are part of the way to a solution to that:
[https://en.wikipedia.org/wiki/HTTP_Strict_Transport_Security](https://en.wikipedia.org/wiki/HTTP_Strict_Transport_Security)

That said, I do think it's unreasonable to think an average Internet user will
discern that you specified https. I think a better practice is to try to use
https on every site, and only fallback if you decide you are willing to accept
the lack of security.

------
anyfoo
_Tim Berners-Lee thought we should have a way of defining strongly-typed
queries: <ISINDEX TYPE="iana:/www/classes/query/personalinfo"> I can be
somewhat confident in saying, in retrospect, I am glad the more generic
solution won out._

The author does not explain why he is glad that the more generic solution won
out. Having strongly-typed queries might have brought as much closer to some
approximation of a practical "semantic web" and done some wonders for web
services, accessibility and others.

Maybe he is glad because not having any strong typing allowed us to have the
flexible, completely free-form web interfaces we have today, but who's to
decide that that wouldn't have emerged anyway; maybe even in a slightly saner
form than the horrible mess we have today.

On a completely unrelated note:

 _Given the power of search engines, it’s possible the best URN format today
would be a simple way for files to point to their former URLs._

This raises immediate concerns about security and spam, but that may be
solvable somehow.

That being said, I really enjoyed reading this thoroughly researched history a
lot, even more than the previous, also great installment about CSS's history.
(But my preference is just because I'm not a Web/Design guy.)

~~~
zackbloom
Sorry for not clarifying that. It's possible I'm only glad because it has
turned out that forms are much more important than search alone. I also think
it's roughly a good thing that we were ultimately given relatively basic
tools, and allowed to build all sorts of things with them. I do agree that
many unfortunate decisions were made in those early days, many of them because
they were just not aware of what problems we would be using the Internet to
solve in 2016.

------
nommm-nommm
If anyone is wondering (like I was) what the ISBN referenced is its The C
Programming Language by Brian Kernighan and Dennis Ritchie.

------
javajosh
As an aside, it would be interesting to adopt a convention where certain links
included a hash of what it pointed to, avoiding the case that the
(sub-)resource changed out from underneath the link. This implies that you'd
want to link to a specific version of a (sub)resource. E.g. you could do
something like what github does with:

[https://github.com/USER/PROJECT/commit/COMMIT_HASH#DIFF_HASH](https://github.com/USER/PROJECT/commit/COMMIT_HASH#DIFF_HASH)

where the DIFF_HASH is a fragment pointing to a particular resource within the
commit.

~~~
zackbloom
There is a spec called Subresource Integrity which does something similar for
resources included inside HTML files.

Doing it for links is problematic though. What do you do if the hash doesn't
match though? Show an error? Show the destination page anyway? What happens if
the author _wants_ to change the content? Should links be immutable?

~~~
javajosh
Cool, that's almost exactly the idea[1]. Although I would argue that resources
should be immutable and versioned. Actually jsbin has a good representation
there where you have:

    
    
        http://domain/resource/version
    

and `[http://domain/resource`](http://domain/resource`) just points to either
the latest version or to a specific one that the author chooses.

The big difference between SRI and my proposal is that the hash go on the URL
rather than an attribute.

[1] [https://www.w3.org/TR/SRI/](https://www.w3.org/TR/SRI/)

------
mkoryak
> As early as 1996 browsers were already inserting the [http://](http://) and
> www. for users automatically.

but not the default install IE9. Figures.

~~~
ufdhigdfh
> but not the default install IE9. Figures

To be fair, the author said browsers, and not incompatible mock-browsers like
IE.

Browsers:IE::cheese:"cheese food"

------
agentgt
In the Java world URL/URI/URN construction, escaping, unescaping, manipulation
is a confusing disparate buggy mess.

Of course there is the obvious broken `java.net.URL` but there are so many
other libraries and coding practices where programmers just continuously screw
up URL/URI/URNs over and over and over. It is like XML/HTML escaping but seems
to be far more rampant in my experience (thankfully most templating languages
now escape automatically).

In large part I believe this is because of the confusion of form encoding and
because of the URI specs following later that supersede the URL specs (but
actually are not entirely compatible).

In our code base alone we use like 8 different URL/URI libraries. HTTP
Components/Client (both 3 and 4), Spring has its own stuff, JAX-RS aka Jersey
has its own stuff, the builtin crappy misnamed URLEncoder, the not that
flexibile java.net.URI, several others that I can't recall. I'm surprised
Guava hasn't joined in the game as well.

I would love a small decoupled URL/URI/URN library that does shit easily and
correctly. URL templates would be nice as well. I have contemplated writing it
several times.

------
utopcell
This is a well-written article, but I find this criticism of URLs a bit
exaggerated. We needed to reach an agreement on URL structure and there is
little point in changing that now. Hardly anyone would abandon English for the
unambiguous Lojban
([https://en.wikipedia.org/wiki/Lojban](https://en.wikipedia.org/wiki/Lojban))
for example.

As for non-expiring identifiers to pages/content, the 'expired' URLs are as
good identifiers as anything, considering URL redirects exist.

I did find this tidbit of information quite intriguing: the creator of the
hierarchical file system attributes his creation to a two hour conversation he
had with Albert Einstein in 1952!

~~~
zackbloom
I'm sorry you read it as criticism, it actually wasn't my intent to criticize
our current URL system. I think it's a valuable system which has enabled the
Internet as we know it.

------
RockyMcNuts
I always suspected there must have been 2 archaic null tokens between the ://
of [http://..](http://..).

If it was a design choice to use a 3-character separator instead of a single
character, seems an odd choice.

~~~
DougBTX
[https://www.ietf.org/rfc/rfc1738.txt](https://www.ietf.org/rfc/rfc1738.txt)
may shed some light on this. It describes the ":" as the separator, while the
// is a scheme-specific prefix, like a super-powered path separator.

More here: [https://www.w3.org/People/Berners-
Lee/FAQ.html#etc](https://www.w3.org/People/Berners-Lee/FAQ.html#etc)

------
kevin_thibedeau
> This system made it possible to refer to different systems from within
> Hypertext, but now that virtually all content is hosted over HTTP, may not
> be as necessary anymore.

WS and WSS will become more and more commonplace over time. I like that a 25+
year old protocol is forward compatible enough to accommodate new methods of
network communication. It could be debatable whether HTTPS and WSS are
necessary for a URL but they give a hard guarantee that a secure connection
will be made and not silently downgraded for those who care about such things.

~~~
zackbloom
I think we're getting closer to the point where browsers will explicitly flag
HTTP as insecure.

------
Azy8BsKXVko
I like the idea of URNs, mostly because _I just think they 're cool_. They'd
be pretty infeasible though.

~~~
MrRadar
If you think URNs are cool give IPFS a look. It's like content-addressing
taken to the next level.

[https://ipfs.io/](https://ipfs.io/)

------
microDude
I am not a web developer, but I have to ask.

Using basic authentication over SSL, does that mean if you entered
[https://user:pass@domain](https://user:pass@domain) that the user and pass
would be sent in the clear, or does this get put into the header and
encrypted?

~~~
csbowe
It's base64 encoded and put into the header, according to the article.

~~~
colejohnson66
But if you use HTTPS, those headers are encrypted, right?

~~~
kevin_thibedeau
The problem is they can end up in the logs of the receiving server and if that
gets hacked...

------
tantalor
> The same thinking should lead us to a superior way of locating specific
> sites on the Web.

Try [http://google.com](http://google.com)

