
The History of the URL - migueldemoura
https://blog.cloudflare.com/the-history-of-the-url/
======
austincheney
That is an excellent article and I learned a tremendous amount.

I do have one minor technical criticism though. It is so common for people to
conjoin _parameter_ with the components of a query string that we don't give
it a second thought. The specification, though, does delineate these terms.
See:
[https://tools.ietf.org/html/rfc3986#section-3.4](https://tools.ietf.org/html/rfc3986#section-3.4)
and the preceding paragraph.

Specifically parameters are trailing data components of the path section of
the URI (URL). The query string is separated from the path section by the
question mark. URI parameters are rarely used though so this is a common
mistake.

Also encoding ampersands into a URI (URL) using HTML encoding schemes is also
common, but that is incorrect. URI encoding uses percent coding as its only
encoding scheme, such as %20 for a space. Using something like _& amp;_ will
literally provide 5 characters in the address unencoded or may result in
something like _%26amp;_ in software that auto-converts characters into the
presumed encoding.

* [https://tools.ietf.org/html/rfc3986#section-2.1](https://tools.ietf.org/html/rfc3986#section-2.1)

* [https://stackoverflow.com/questions/16622504/escaping-ampers...](https://stackoverflow.com/questions/16622504/escaping-ampersand-in-url)

~~~
maggit
>Also encoding ampersands into a URI (URL) using HTML encoding schemes is also
common, but that is incorrect.

To encode any string (for example a URL) containing & in HTML, you must HTML-
encode that &. Using &amp; in the value of the href attribute for an a-tag
must result in a URL containing just & in place of the entire entity. This is
a property of HTML that has nothing to do with URLs or URL encodings.

~~~
austincheney
So let's say you have a raw address with an ampersand that needs to be encoded
(the second one) so as not to confuse a URI parser with into thinking there
are 3 query string data segments when there are only 2 as the second ampersand
is part of a value and not a separator:

    
    
        http://domain.com/?name=data&tag=c&t
    

You will need to encode that ampersand so that it is interpreted as something
other than syntax:

    
    
        http://domain.com/?name=data&tag=c%26t
    

Now the first ampersand is not encoded but the second one is. You are correct
that ampersands are also syntax characters in HTML/XML so if you wanted to
place that address in HTML code it would need to be escaped in HTML:

    
    
        http://domain.com/?name=data&amp;tag=c%26t
    

That address can now be inserted as the value of an HTML anchor tag as such:

    
    
        <a href="http://domain.com/?name=data&amp;tag=c%26t">somewhere</a>
    

The important point to distinguish is that addresses are often used in
contexts outside of HTML, even in the browser. For example the address bar at
the top of the browser is outside the context of the view port that displays
HTML content, and so the appropriate text there is:

    
    
        http://domain.com/?name=data&tag=c%26t
    

This is so because URI only has one encoding scheme, which is percent
encoding.

------
russellbeattie
I love articles like this, in part because they remind me of the early 90s
when I first started reading about this stuff and it all seemed arcane and
magical. [1]

I remember dismissing the "World Wide Web" because my 80286-based IBM PC I
used at college in 1993 couldn't run a graphical web browser (that I knew of)
so I compared the terminal versions of a web browser to Gopher and determined
the latter was far superior - it had more content and was much cleaner to use
in a terminal.

The history of the Internet and Web definitely would have been soooo different
had URLs been formatted like "http:com/example/foo/bar/baz" for sure. It's so
much cleaner and sensical. Part of the mystique of "foo.com" is that it
somehow seems completely different from "bar.org". Not sure why, but it just
is.

Just a side note: DOS and Windows using \ instead of / is annoying and has
been annoying for nearly 40 years and I don't ever think I'll ever find it not
annoying. You'd think 4 decades would be enough time, but it still bugs me.

1\.
[https://en.m.wikipedia.org/wiki/Whole_Internet_User's_Guide_...](https://en.m.wikipedia.org/wiki/Whole_Internet_User's_Guide_and_Catalog)

~~~
thedance
You could have viewed images on a 286 (or even an 8086/8088) using NCSA Telnet
for DOS, just for the record.

------
shellac
> In 1992 Tim Berners-Lee created three things, giving birth to what we
> consider the Internet. The HTTP protocol, HTML, and the URL.

What we call the web, surely? I appreciate we conflate the two, but in this
context I think that's what was meant.

And it's really the URL (URI, IRI, URN...) that makes the web. An amazing
thing.

~~~
tannhaeuser
> _And it 's really the URL (URI, IRI, URN...) that makes the web._

Well Chrome devs seem to disagree, with the recent push to hide URLs, and so
do SPA devs. Is the article meant to save the URL, or rather a fairwell?
Anyway, URLs aren't particularly well-designed IMO, for their use of the
ampersand char as separator in the query part alone, since these conflict with
ampersand as used for starting entity references in HTML and other SGML docs.

~~~
thenewnewguy
> and so do SPA devs

??? - those usually use the URL bar in my experience.

~~~
vanadium
In most SPAs, however, URLs aren't used like conventional URLs; they're
treated as routes. I think that's more the point here.

They aren't conventionally "shareable", first off, which violates more or less
the spirit of what a URL is supposed to represent on the Web.

~~~
scrollaway
This is pretty much untrue. Gmail is an SPA for example, and I can share
"[https://mail.google.com/mail/u/0/#spam"](https://mail.google.com/mail/u/0/#spam")
with you, or
"[https://mail.google.com/mail/u/0/#inbox/FMfcgxwGDWqXKSSfXBDQ...](https://mail.google.com/mail/u/0/#inbox/FMfcgxwGDWqXKSSfXBDQMrkGlcFVBlzk")
(of course you can't access the latter because you don't have the appropriate
permissions for it … but point still stands: I can open that URL in a new tab
and it'll open the same email).

I've written plenty of SPAs myself that use the URL and you couldn't actually
tell it's an SPA unless you have the technical experience to know what to look
for.

If what you're saying is "this infinite-scrolling page sucks at respecting
URLs", that's something else, and it's not specific to SPAs. Infinite
scrolling predates SPAs (Here is one for jQuery! [https://infinite-
scroll.com/#initialize-with-jquery](https://infinite-scroll.com/#initialize-
with-jquery)), and doesn't _have_ to break URLs if you don't write your code
like an uncivilized brute.

~~~
cc81
Breaking the back button and not using propers URLs are two cardinal sins in
SPA development.

------
chrisweekly
Here's the original, classic "Cool URIs Don't Change" post by TBL himself:
[https://www.w3.org/Provider/Style/URI.html](https://www.w3.org/Provider/Style/URI.html)

~~~
beardicus
Interestingly, the canonical link to that page is actually:
[https://www.w3.org/Provider/Style/URI](https://www.w3.org/Provider/Style/URI)

"You may not be using HTML for that page in 20 years time, but you might want
today's links to it to still be valid. The canonical way of making links to
the W3C site doesn't use the extension."

I am constantly surprised by how difficult many static site generators make it
to create URLs of this format (notice no trailing slash, before you tell me
about the "Pretty URLs" options in most software).

~~~
chrisweekly
Ha! How ironic that (in part bc on phone and hasty) I grabbed the wrong URL.

------
theclaw
Anyone interested in the development of the ARPANET and its transformation
into the Internet we know today owes it to themselves to read Where Wizards
Stay Up Late by Hafner and Lyon - it’s a great read and the audiobook isn’t
bad either.

------
thedance
If you're going to write a mile-long article about the URL, at least get the
details correct. The leftmost part of a URL is not the "protocol". It is
called the "scheme". The scheme doesn't tell you "the protocol which should be
used to access it", it tells you how to interpret the remainder of the URL.

~~~
colejohnson66
Except HTTP stands for “HyperText Transfer _Protocol_ ”

~~~
thedance
There's nothing in "data" or "file" that stands for /Protocol/.

------
ck2
You suddenly realize you are very old when an article tries to impress with
ancient photos of PDP11 and cradle-type dial-up modems and it's startling that
I've used all of that in my lifetime, extensively.

------
neillyons
> Root DNS servers operate in safes, inside locked cages. A clock sits on the
> safe to ensure the camera feed hasn’t been looped.

That is cool. Does anyone have any more info? Perhaps a picture?

~~~
zackbloom
I saw one at a data center in El Segundo once. It was 2011 and about 60% of
the DC was a single guy sitting on the floor quietly packing up MySpace
servers.

~~~
rcaught
Was it Tom Anderson?

------
btrettel
What are the available solutions to the problem of locating other copies of
webpages or documents online? (Let's assume that the page of interest is not
on the Internet Archive.) This article mentions a few:

> I was able to find these pages through Google, which has functionally made
> page titles the URN of today.

> Given the power of search engines, it’s possible the best URN format today
> would be a simple way for files to point to their former URLs.

Daniel Bernstein proposes a document ID that can be found in search engines:
[https://cr.yp.to/bib/documentid.html](https://cr.yp.to/bib/documentid.html)

I actually started using this before, but found it to be clumsy and stopped
using it.

Someone else has suggested a UUID instead:
[https://lobste.rs/s/xltmol/this_page_is_designed_last#c_nis6...](https://lobste.rs/s/xltmol/this_page_is_designed_last#c_nis6no)

But that's still clumsy. I'd prefer something shorter.

Perhaps the title of the page is the best option, as people are more likely to
have that saved than the UUID:
[https://lobste.rs/s/xltmol/this_page_is_designed_last#c_0snr...](https://lobste.rs/s/xltmol/this_page_is_designed_last#c_0snrhr)

> I imagine it’d be uncommon for someone has the UUID but not the website
> saved.

~~~
AgentME
If a page is available on IPFS, then the page would be accessible under a URL
based on the page's hash, and anyone with a cached/saved copy of the page
would be able to help host it at that URL. (IPFS works very similar to
torrents with magnet links.)

~~~
btrettel
Interesting. How would IPFS handle updates to a page? I assume that the
updated page would have to be hosted separately.

~~~
AgentME
Yeah, that generally applies to content-addressable systems like it (and
torrents). An updated version of the page would have a new hash and a separate
IPFS link.

To make a URL for content that can change, you would need to use a system that
lets you create a mutable URL, and then you can make that point to an IPFS
link. DNS works well for that. You can add a "_dnslink" TXT DNS record to a
domain, and then when anyone with an IPFS-supporting browser (or browser add-
on) accesses the domain, their browser can fetch the content over IPFS from
anyone seeding it, and help seed it if the user wants. (Yes, this wouldn't
work well at all for domains with dynamic content. Works great for domains
that have static content, including sites made by static site generators,
etc.)

I personally serve my blog with IPFS by making its files accessible over IPFS,
putting a _dnslink TXT record on my domain pointing to the directory's current
IPFS link, and then my domain's A record is pointing to a service (cloudflare-
ipfs.com) that responds to HTTP requests by serving contents from the IPFS
link that my _dnslink record points to. I'm using multiple free IPFS pinning
services to keep my blog's files seeded on IPFS. I like that I'm not tied down
to any of them, and I could easily replace them with other services or my own
server without changing any of the rest of the setup. Additionally, anyone
that liked my blog could help seed it themselves, so it could outlast the
pinning services and me.

Assuming a world where IPFS was commonly-supported and my content was well-
liked enough to get seeded by others, the only point of failure for keeping my
site up is the domain name staying staying renewed and the DNS service I'm
using staying up. Though if those went down, as long as someone still had the
last IPFS link to my site, my site would still be accessible through that as
long as people seeded it.

I believe the Ethereum Name Service would also be a good decentralized
alternative to using DNS for keeping update-able URLs pointing to IPFS
content, but I haven't personally used it and I don't know if there are good
integrations that make it usable for that today. IPFS also has a feature
called IPNS for creating mutable links to content that can be updated by
whoever owns the private key, which sounds perfect on paper, but it doesn't
work well in my experience for a few reasons (latency, timeouts, etc), so I
wouldn't recommend it.

~~~
btrettel
Thanks for all this information! I think I might do the _dnslink TXT record
approach you mention for my own website (just a static site), though I'll
first need to learn more about DNS and IPFS.

------
AKluge
I remember when the hosts.txt file crossed a half page. I thought this network
is taking off. :)

------
divbzero
This is a masterful history and a reminder that what we have now is just a
quasi-frozen snapshot of evolving solutions. Is there any chance that an
alternative (URN? IPFS? Dat?) will gain traction and resolve some of the
shortcomings of the URL?

~~~
Agenttin
When google first moved search to the address bar they started the transition
to using page titles as addresses. This is a somewhat sloppy way to go about
things, but when I visit this site, I just type "hacker news" not the FQDN.

------
ChrisArchitect
bah, believe this was adapted from an article from 2016, commentary here:
[https://news.ycombinator.com/item?id=12117540](https://news.ycombinator.com/item?id=12117540)

------
CaptArmchair
I started reading the article with much interest... up until the bit about the
Semantic Web. Then I felt things went downhill.

> One such effort was the Semantic Web. The dream was to create a Resource
> Description Framework (editorial note: run away from any team which seeks to
> create a framework), which would allow metadata about content to be
> universally expressed. For example, rather than creating a nice web page
> about my Corvette Stingray, I could make an RDF document describing its
> size, color, and the number of speeding tickets I had gotten while driving
> it.

> This is, of course, in no way a bad idea. But the format was XML based, and
> there was a big chicken-and-egg problem between having the entire world
> documented, and having the browsers do anything useful with that
> documentation.

The author completely falls short to describe the evolution of the SemWeb over
the past 10 years. Tons of specs, several declarative languages and
technologies have been grown to not just get beyond the verbosity of a
serialization format such as XML, but also move away from the classic
relational data model.

Turtle, JSON-LD, SPARQL, Neo4J, Linked Data Fragments,... come to mind. And
then there are the emerging applications of linked data. If anything, the
Federated Web is exactly about URLs and semantic web technologies based on
linking and contextualizing data.

The entire premise of Tim Berner Lee's Solid/Inrupt is based on these
standards including URI's.

Linked data and federation isn't just about challenging social media, it's
also about creating knowledge graphs - such as wikidata.org - and creating
opportunities for things such as open access and open science.

Then there's this:

> httpRange-14 sought to answer the fundamental question of what a URL is.
> Does a URL always refer to a document, or can it refer to anything? Can I
> have a URL which points to my car?

> They didn’t attempt to answer that question in any satisfying manner.
> Instead they focused on how and when we can use 303 redirects to point users
> from links which aren’t documents to ones which are, and when we can use URL
> fragments (the bit after the ‘#’) to point users to linked data.

Err. They did.

That's what the Resource Description Framework is all about. It gives you a
few foundational building blocks for describing the world. Even more so, URI's
have absolutely NOTHING to do with HTTP status codes. It just so happens that
HTTP leverages URI's and creates a subset called HTTP URL's that allows the
identification and dereference of webbased resources.

You can use URI's as globally unique identifiers in a database. You could use
URN's to identify books. For instance urn:isbn:0451450523 is an identifier for
the 1968 novel The Last Unicorn.

So, this is a false claim. I could forgive them for inadvertently not looking
beyond URL's as a mechanism used within the context of HTTP communication.

> In the world of web applications, it can be a little odd to think of the
> basis for the web being the hyperlink. It is a method of linking one
> document to another, which was gradually augmented with styling, code
> execution, sessions, authentication, and ultimately became the social shared
> computing experience so many 70s researchers were trying (and failing) to
> create. Ultimately, the conclusion is just as true for any project or
> startup today as it was then: all that matters is adoption. If you can get
> people to use it, however slipshod it might be, they will help you craft it
> into what they need. The corollary is, of course, no one is using it, it
> doesn’t matter how technically sound it might be. There are countless tools
> which millions of hours of work went into which precisely no one uses today.

I'm not even sure what the conclusion is here. Did the 'hyperlink' fail? did
the concept of a 'URI' fail? (both are different things!) Because neither
failed, on the contrary!

Then there's this wonky comparison of the origin of the Web with a single
project or a startup. The author did the entire research on the history of the
URI but they still failed to see that the Internet and the Web were invented
by committee and by coincidence. Pioneers all over the place had good ideas,
some coalesced and succeeded, others didn't. Some were adapted to work
together in a piece-meal fashion such as Basic Auth.

And that's totally normal. Organic growth and distribute development is the
baseline. Yes, the Web as we know it today is the result of many competing
voices, but at the same time it could only work if everyone ended up agreeing
over the basics.

The fact of the matter is that some companies - looking at you FAANG - would
rather have us all locked in a closed, black-box ecosystems, rather then
having open standards around that allow for interoperability, and thus create
opportunities for new threats to challenge their business interests.

I understand that the article is written by CloudFlare, a CDN company with its
own interests. But I'm trying to wrap my ahead around how the author failed in
addressing exactly future opportunities and threats, after this entire exposé.

~~~
jgrahamc
_I understand that the article is written by CloudFlare, a CDN company with
its own interests._

Not sure what you mean by that, but Zack wasn't writing something to further
some secret interest Cloudflare has in the structure of URLs.

~~~
shellac
I think the operative phrase is 'a CDN company'.

URLs are names for things (companies, mailboxes, pictures of cats), but
they're also (encoded) directions to get representations of those named
things.

CloudFlare is concerned with the mechanics mostly, the latter. Things like
Wikidata, knowledge bases, schema.org are interested in the former
perspective.

~~~
nl
Cloudflare has Edge Workers and lots of other technologies that would be
helpful to the semantic web.

Anything that is URL addressable is great for Cloudflare.

