Hacker News new | past | comments | ask | show | jobs | submit login
The History of the URL: Path, Fragment, Query, and Auth (eager.io)
333 points by zackbloom on July 18, 2016 | hide | past | web | favorite | 71 comments



>Given the power of search engines, it’s possible the best URN format today would be a simple way for files to point to their former URLs. We could allow the search engines to index this information, and link us as appropriate:

  <!-- On http://zack.is/history -->
  <link rel="past-url" href="http://zackbloom.com/history.html">
  <link rel="past-url" href="http://zack.is/history.html">
This seems like a heart-warmingly naive proposal; cross-domain search result hijacking.


Really good point. It would either only make sense on the same origin, or if you authenticated the domain.


How? Who do you trust for the authentication? See SSL certificates, domain name registration, and mail servers for some ideas around how this plays out

It's not that what we have now is perfect, but the use-case of typing in example.com or googling example is indeed a low enough bar that non-technical users get it, and re-tooling everything to support an alternate format is a non-starter. See also: IPv6 adoption, and not many people are typing in or googling IPs.


In terms of authentication I think being able to add a (very specific) piece of content to the site or being able to add a DNS record are two common solutions. I'm not suggesting it's viable in this case though.


IPv6 end-users represent >30% of the traffic hitting some of the services I operate.


Perhaps the author has a point, but I don't think it's as bad as he puts it. Grandma and Grandad don't need to learn UNIX paths to use the net, they just need to know one thing: the hostname. They type in "google.com" or "walmart.com" and that's it, not hard to grok. Perhaps they don't fully get what .com means but they know it's needed and some sites have different ones.

You never see an advert with "visit our site at http://user:pass@mycompany.com/app/page1244.html?abc=def#hel..., it's just "mycompany.com". The structure of the URL's once the user has hit the landing page and clicked a couple of links doesn't need to be known by the user, nor should the user care.

To put it simply: if it was hard then the web wouldn't be as big as it is now. Non-technical people understand simple URI's and that's all that's needed.


I don't actually take sides on the question of if the URL should be formatted as it is. It's almost too ingrained in my mind at this point to seriously consider other forms.

I would say my biggest criticism of it is it's rather intolerant of human error. A single mistyped character, and you are not going to the website you intend. I think that's the reason searching for something on Google can actually be a better experience than trying to figure out the URL directly.

To your advertisement point, there was a long history of companies including the http:// and www in their ads. There's always this of course too: http://i.imgur.com/cfTxpd2.jpg


I believe the problem with getting to specific content still exists with the search engine approach (and, as you say, may be partially resolved by typing in the exact title of the content you're looking for).

I think search engines are also powerful in that they can provide context - google knows that when I search for C#, I am not looking for anything musical...but that inference is at odds with being able to pinpoint a specific content item out of a more or less infinite pile.

We were all forced to learn the technical constructs of mailing addresses and phone numbers, and URLs provide a similar service to a much larger problem set (so it's not surprising to me that they must be more complex).

I'm sure someone smarter than me will find a solution sooner or later...


That link seems to lead to a 1x1 jpeg.


Ah thanks, fixed!


The structure of the URL's once the user has hit the landing page and clicked a couple of links doesn't need to be known by the user, nor should the user care.

Sorry, but that's a completely wrong and horrible attitude and it's the sort of sentiment which leads to things like browsers hiding URLs and the resultant rise of even more computer-illiterate users completely dependent on (and thus at the mercy of) search engines and their opaque, proprietary, and sometimes completely idiotic ranking systems... you may find the vigorous discussion here relevant:

https://news.ycombinator.com/item?id=7677898


That's not entirely true:

http://cdn.doctorswithoutborders.org/sites/usa/files/tpp_ad_... http://i.imgur.com/HWNM1KZ.jpg http://www.shofforddesign.com/wp-content/uploads/Oakley-Maga...

This sort of thing is useful for advertising specifically. In the examples I've given:

1) targeted information 2) short url 3) direct-to-relevant-site from the ad

In my experience, in-house or business-wide advertising tends to use just the domain, but campaigns farmed out to marketing firms or campaigns related to specific products tend to use path info as well as domain name.

And while I've never seen auth credentials in an advertised URL, I have frequently seen "example.com/product#platform" style urls, where #platform (and sometimes ?platform) indicates which magazine the ad appeared in.


All of your examples have a domain name followed by a single, short landing page identifier. No knowledge of URLs / paths needed.

You seem to be reinforcing the OP's point.


None of those examples support your point, they are all short URLs. Also, the Doctors Without Borders one is a URL and a social media hashtag. Probably every time you see #whatever it is for a hashtag, it has nothing to do with a URL.


> I have frequently seen "example.com/product#platform" style urls, where #platform (and sometimes ?platform) indicates which magazine the ad appeared in.

Since the fragment doesn't get sent to the server, that seems broken …


You can grab anchors with google analytics


Ouch, that's horrible — I'd no idea. Yet another reason to disable JavaScript by default.


In Japan ads don't say go to "somecompany.com" they say "search for 'somecompany'"

Yes a realize that might be problematic. Just passing on the different internet culture.


"search for 'somecompany'" only really works if you have enough seo for all search services for that term. "somecompany.com" is something that gives just enough context to let you know it's a webpage and you type it in your browser. Just like #pokemongo lets you know it's a topic, likely on twitter and @pokemongo is the username of a company, likely on twitter.

But on the "pro" side for searching, it should help autocorrect people who can't remember the specific name or the correct spelling.


I wonder how much of this is related to historical inability to use native characters in domain names.


Great article indeed. What I personally missed there was a clear statement about segment (or path) parameters as opposed to query ("search") parameters:

    //domain.tld/segment;param1=value1,param2/segment?query=..
Clearly nowadays it seems a bit odd. I bumped into it while reading about this in the "HTTP: The Definitive Guide" from 2002 chapter "2.2 URL Syntax". Actually read this just few years ago and ever since pondered whether this aspect was or was not really adopted by some wild CGI at the times I haven't experienced myself.

Definitive guide apparently paraphrased RFC 2396 which clearly defined semicolon as segment parameter delimiter, but later was obsoleted [a] by RFC 3986, moving former standard to "possible practise" [b] stating that:

> URI producing applications often use the reserved characters allowed in a segment to delimit scheme-specific or dereference-handler-specific subcomponents. For example, the semicolon (";") and equals ("=") reserved characters are often used to delimit parameters and parameter values applicable to that segment. The comma (",") reserved character is often used for similar purposes. For example, one URI producer might use a segment such as "name;v=1.1" to indicate a reference to version 1.1 of "name", whereas another might use a segment such as "name,1.1" to indicate the same. Parameter types may be defined by scheme-specific semantics, but in most cases the syntax of a parameter is specific to the implementation of the URI's dereferencing algorithm. [c]

[a] would't have noticed without: http://stackoverflow.com/questions/6444492/can-any-path-segm... [b] https://tools.ietf.org/html/rfc3986#section-3.3


Yes, path segment parameters are a rather interesting (but mostly historical) curiosity.

Recently I have begun using them in a few of my internal APIs, where it is useful to have a distinction between "parameters used to filter a set of items" and "parameters used to request items in a particular format"... an example:

    GET /v1/tickets;status=closed;milestone=100?fields=id,title,reporter&sort=-last_updated
Pro: this nicely addresses the problem of collisions between the names of data fields (such as "title") and the names of response-modifying parameters (such as "sort").

Con: consumers of your API get to discover the joys of RFC 3986, because there is essentially zero support for path segment parameters in modern HTTP libraries.


tbh at a glance this seems a tad bit intimidating to handle, but I certainly see a lot of possible benefit from this as well. I should defo take a look at "HTTP: The Definitive Guide". Also, what happened to [c] D:



Just a random URL fact: the TLDs for domains could technically be used as domains as well, if ICANN ever allowed it.

It wouldn't make sense for TLDs like com, net, org, etc., but for trademarked TLDs like barclays, youtube, pwc, etc. visitors could essentially go straight to that webpage with the TLD, like https://youtube


the TLDs for domains could technically be used as domains as well

That's because they... actually are domains too. Top Level Domains, as the acronym says.

It wouldn't make sense for TLDs like com, net, org, etc.

Actually it might --- that's a great place to put all the information about (and maybe instructions on how to register) a subdomain there.

But, to use the precise terminology, a domain name with a single 'label' has historically been treated as a "relative" domain, e.g. https://localhost , and no doubt many of you will be familiar with intranet sites that use similar "non-qualified" single-label names.


That 'relative domain' problem is not specific to one-word domain names. In fact, ALL domains that don't end in '.' are considered 'relative' domains. There's a difference between http://somehost/ and http://somehost./ - the first says 'look for somehost in any of the available default domains', while the second says 'look for a top level domain name called localhost'. Likewise http://news.ycombinator.com/ and http://news.ycombinator.com./ are handled differently - the first is actually also considered as a relative URL, and tried against all your local DNS suffixes before being tried against the root.


"Single name domains" as you refer to them usually only work because your DNS search domains are filling the rest in for you. This is why people who work for somecompany.com can simply go to http://intranet , because their DHCP server has somecompany.com in the DNS search domains and the browser/OS will be resolving it to http://intranet.somecompany.com under the hood.


Most of the time, www.TLD goes somewhere, but instructing non-technical users to visit "www.example" would likely confuse them much more than owning "example.com" with a CNAME set up for "www.example.com" as well. gTLDs are, IMO, a stupid cash grab which may or may not harm domain squatters.


Well, if you look at the history of URL use in the public offline realm, it keeps getting shorter.

URLs started as "http://www.example.com" in the late 90's, and then went to "www.example.com", and are now mostly "example.com". For major trademarks, it could be the natural progression to just go to "example". Or not - internet use is weirdly unpredictable.


My guess is that it'll stick to "example.com" unless we switch to a proprietary service (like Facebook or Twitter). A nice thing about "example.com" is that it gives you enough context to know it's a web address and you type it in your browser. Thinking back to AOL, they always had to say "AOL Keyword: Example" or "Keyword: Example" to give you enough context. Years ago maybe doing something like having it blue and underlined might have caught on and let you just use the word?

One way it could be shorter and keep context is if we hijacked a special character like Twitter does #example or @twitter--but I can't imagine standards bodies getting on board with using a special character (or trying to get people to adopt it as a replacement for web URLs).


One way I just thought of that wouldn't sacrifice a special character would be to come up with some symbol that means URL and have web browsers use that next to their URL bar. Could be an emoji-like symbol--maybe something similar to the html5 shield? But luck getting everyone on board with that.


Generic TLDs have mostly solved that.


There was a push for custom TLDs once upon a time.


Solved what?


If you are thinking about .xyz and similar, you still have to have another domain below those. you can't simply register xyz and have it point directly at a server.


Sure you can: http://ai./ (there are other CCTLDs doing this).

Only one new TLD does this, and the page is empty: http://мон./ or http://xn--l1acc./


And now that Google owns the TLD `google' with their whole com.google thing a while back, what's to stop them from setting up https://google./? Is there any technical reason? Do you have to go through ICANN?


I think you're mistaken. There isn't a TLD for google. Under some of the ICANN proposals, that would have been possible, but to date I don't think it's been done.



They officially host their domain registration website on domains.google, too:

https://domains.google/


IMHO the arguments given in the quotes in the article are not very compelling, in view of the fact that humans have been identifying physical locations with far more complex and inconsistent systems for literally centuries:

https://news.ycombinator.com/item?id=8907301

Compared to "real-life addresses", URLs are an absolute pleasure to handle and understand; which naturally raises the question of why so many people seem to have trouble, or are suggesting that others do, with URLs? It's just a "virtual" address, in an easily-parseable format for identifying a location in "cyberspace". Perhaps its the "every time you put an intelligent person in front of a computer, his/her brain suddenly disappears" syndrome (for lack of a better word)?

The advocacy of search engines instead of URLs is also not such a great idea; sadly, search engines like Google today do not work like a 'grep' that lets you find exactly what you're looking for, do not index every page (or allow you to see every result, which is somewhat equivalent), and the results are also strongly dependent upon some proprietary ranking algorithm which others have little to no control over. If relying on links and having them disappear is bad, relying on SERPs is even worse since they are far more dynamic and dependent on many more external factors which may even include things like which country your IP says you're from when you search.

Search engines are definitely useful for finding things, but as someone who has a collection of links gathered over many years, most of which are still alive and yet Google does not acknowledge the existence of even when the URL is searched, I am extremely averse to search engines as a sort of replacement for URLs.


The simple answer is humans make errors. Even mailing a letter the post office often applies corrections to written addresses which contain mistakes. This is only possible because the address system is relatively sparse and over-constrained.

For example:

Bob Smith 162 Portsmouth St Denver, CO 42348

Encodes the city in both the zip and city / state, allowing either to be wrong. If 162 doesn't exist on that street, it's possible they can correct it based on the name. If it's Portsmouth Ave not St, they likely can figure that out.

None of that flexibility exists in URLs. A single mistyped character or misspelling and you are going to the wrong place with no way of getting where you want (without search engines).

It's a system which works very well for machines, and only passably well for error-prone humans.


This is a somewhat compelling reason why people don't use URLs like they do addresses (that plus the fact that physical locations are "discoverable", so if you know something is in the general vicinity of 40th street you can probably wander around a bit to find it).

But I don't know if it explains why people don't understand URLs. Physical location addresses are no more complicated than URLs (with the exception of URL-encoded blobs in URLs, which are not usually human-readable). That said, I don't usually talk about this stuff with non-tech types, so maybe the average person does understand URLs more or less.


Tim Berners-Lee thought we should have a way of defining strongly-typed queries: <ISINDEX TYPE="iana:/www/classes/query/personalinfo"> I can be somewhat confident in saying, in retrospect, I am glad the more generic solution won out.

The author does not explain why he is glad that the more generic solution won out. Having strongly-typed queries might have brought as much closer to some approximation of a practical "semantic web" and done some wonders for web services, accessibility and others.

Maybe he is glad because not having any strong typing allowed us to have the flexible, completely free-form web interfaces we have today, but who's to decide that that wouldn't have emerged anyway; maybe even in a slightly saner form than the horrible mess we have today.

On a completely unrelated note:

Given the power of search engines, it’s possible the best URN format today would be a simple way for files to point to their former URLs.

This raises immediate concerns about security and spam, but that may be solvable somehow.

That being said, I really enjoyed reading this thoroughly researched history a lot, even more than the previous, also great installment about CSS's history. (But my preference is just because I'm not a Web/Design guy.)


Sorry for not clarifying that. It's possible I'm only glad because it has turned out that forms are much more important than search alone. I also think it's roughly a good thing that we were ultimately given relatively basic tools, and allowed to build all sorts of things with them. I do agree that many unfortunate decisions were made in those early days, many of them because they were just not aware of what problems we would be using the Internet to solve in 2016.


If anyone is wondering (like I was) what the ISBN referenced is its The C Programming Language by Brian Kernighan and Dennis Ritchie.


As an aside, it would be interesting to adopt a convention where certain links included a hash of what it pointed to, avoiding the case that the (sub-)resource changed out from underneath the link. This implies that you'd want to link to a specific version of a (sub)resource. E.g. you could do something like what github does with:

https://github.com/USER/PROJECT/commit/COMMIT_HASH#DIFF_HASH

where the DIFF_HASH is a fragment pointing to a particular resource within the commit.


There is a spec called Subresource Integrity which does something similar for resources included inside HTML files.

Doing it for links is problematic though. What do you do if the hash doesn't match though? Show an error? Show the destination page anyway? What happens if the author _wants_ to change the content? Should links be immutable?


Cool, that's almost exactly the idea[1]. Although I would argue that resources should be immutable and versioned. Actually jsbin has a good representation there where you have:

    http://domain/resource/version
and `http://domain/resource` just points to either the latest version or to a specific one that the author chooses.

The big difference between SRI and my proposal is that the hash go on the URL rather than an attribute.

[1] https://www.w3.org/TR/SRI/


> As early as 1996 browsers were already inserting the http:// and www. for users automatically.

but not the default install IE9. Figures.


> but not the default install IE9. Figures

To be fair, the author said browsers, and not incompatible mock-browsers like IE.

Browsers:IE::cheese:"cheese food"


> rendering any advertisement which still contains them truly ridiculous

I don't find it ridiculous, since despite gopher going out of existence, and ftp being a minority, http vs https distinction is quite important until the present day, especially considering that redirect form http to https can be insecure and the proper way to open many sites is explicitly with https://

It might get fixed in the future, but it didn't happen yet.


We are part of the way to a solution to that: https://en.wikipedia.org/wiki/HTTP_Strict_Transport_Security

That said, I do think it's unreasonable to think an average Internet user will discern that you specified https. I think a better practice is to try to use https on every site, and only fallback if you decide you are willing to accept the lack of security.


In the Java world URL/URI/URN construction, escaping, unescaping, manipulation is a confusing disparate buggy mess.

Of course there is the obvious broken `java.net.URL` but there are so many other libraries and coding practices where programmers just continuously screw up URL/URI/URNs over and over and over. It is like XML/HTML escaping but seems to be far more rampant in my experience (thankfully most templating languages now escape automatically).

In large part I believe this is because of the confusion of form encoding and because of the URI specs following later that supersede the URL specs (but actually are not entirely compatible).

In our code base alone we use like 8 different URL/URI libraries. HTTP Components/Client (both 3 and 4), Spring has its own stuff, JAX-RS aka Jersey has its own stuff, the builtin crappy misnamed URLEncoder, the not that flexibile java.net.URI, several others that I can't recall. I'm surprised Guava hasn't joined in the game as well.

I would love a small decoupled URL/URI/URN library that does shit easily and correctly. URL templates would be nice as well. I have contemplated writing it several times.


This is a well-written article, but I find this criticism of URLs a bit exaggerated. We needed to reach an agreement on URL structure and there is little point in changing that now. Hardly anyone would abandon English for the unambiguous Lojban (https://en.wikipedia.org/wiki/Lojban) for example.

As for non-expiring identifiers to pages/content, the 'expired' URLs are as good identifiers as anything, considering URL redirects exist.

I did find this tidbit of information quite intriguing: the creator of the hierarchical file system attributes his creation to a two hour conversation he had with Albert Einstein in 1952!


I'm sorry you read it as criticism, it actually wasn't my intent to criticize our current URL system. I think it's a valuable system which has enabled the Internet as we know it.


I always suspected there must have been 2 archaic null tokens between the :// of http://...

If it was a design choice to use a 3-character separator instead of a single character, seems an odd choice.


https://www.ietf.org/rfc/rfc1738.txt may shed some light on this. It describes the ":" as the separator, while the // is a scheme-specific prefix, like a super-powered path separator.

More here: https://www.w3.org/People/Berners-Lee/FAQ.html#etc


Well, it does allow references to //example.com to cover both http and https schemes.


It's discussed a bit in the first portion of this series: https://eager.io/blog/the-history-of-the-url-domain-and-prot...


> This system made it possible to refer to different systems from within Hypertext, but now that virtually all content is hosted over HTTP, may not be as necessary anymore.

WS and WSS will become more and more commonplace over time. I like that a 25+ year old protocol is forward compatible enough to accommodate new methods of network communication. It could be debatable whether HTTPS and WSS are necessary for a URL but they give a hard guarantee that a secure connection will be made and not silently downgraded for those who care about such things.


I think we're getting closer to the point where browsers will explicitly flag HTTP as insecure.


I like the idea of URNs, mostly because I just think they're cool. They'd be pretty infeasible though.


If you think URNs are cool give IPFS a look. It's like content-addressing taken to the next level.

https://ipfs.io/


I am not a web developer, but I have to ask.

Using basic authentication over SSL, does that mean if you entered https://user:pass@domain that the user and pass would be sent in the clear, or does this get put into the header and encrypted?


Yes, basic authentication is encrypted over SSL but there are more problems to that: https://security.stackexchange.com/questions/988/is-basic-au...


It's base64 encoded and put into the header, according to the article.


But if you use HTTPS, those headers are encrypted, right?


The problem is they can end up in the logs of the receiving server and if that gets hacked...


> The same thinking should lead us to a superior way of locating specific sites on the Web.

Try http://google.com




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: