Hacker News new | past | comments | ask | show | jobs | submit login
The History of the URL (cloudflare.com)
276 points by migueldemoura 24 days ago | hide | past | web | favorite | 44 comments



That is an excellent article and I learned a tremendous amount.

I do have one minor technical criticism though. It is so common for people to conjoin parameter with the components of a query string that we don't give it a second thought. The specification, though, does delineate these terms. See: https://tools.ietf.org/html/rfc3986#section-3.4 and the preceding paragraph.

Specifically parameters are trailing data components of the path section of the URI (URL). The query string is separated from the path section by the question mark. URI parameters are rarely used though so this is a common mistake.

Also encoding ampersands into a URI (URL) using HTML encoding schemes is also common, but that is incorrect. URI encoding uses percent coding as its only encoding scheme, such as %20 for a space. Using something like & will literally provide 5 characters in the address unencoded or may result in something like %26amp; in software that auto-converts characters into the presumed encoding.

* https://tools.ietf.org/html/rfc3986#section-2.1

* https://stackoverflow.com/questions/16622504/escaping-ampers...


It's important to use the jargon precisely, as you did, otherwise you end up with gibberish that nobody understands, like Python's "get_selector" function in urllib. Nobody knows what the heck a selector is, and the word does not even appear in RFC 3986.


I believe the discussion of encoding ampersands is as relates to printing them out in the text of the page, where you would indeed want to use the HTML entity encoding.


>Also encoding ampersands into a URI (URL) using HTML encoding schemes is also common, but that is incorrect.

To encode any string (for example a URL) containing & in HTML, you must HTML-encode that &. Using & in the value of the href attribute for an a-tag must result in a URL containing just & in place of the entire entity. This is a property of HTML that has nothing to do with URLs or URL encodings.


So let's say you have a raw address with an ampersand that needs to be encoded (the second one) so as not to confuse a URI parser with into thinking there are 3 query string data segments when there are only 2 as the second ampersand is part of a value and not a separator:

    http://domain.com/?name=data&tag=c&t
You will need to encode that ampersand so that it is interpreted as something other than syntax:

    http://domain.com/?name=data&tag=c%26t
Now the first ampersand is not encoded but the second one is. You are correct that ampersands are also syntax characters in HTML/XML so if you wanted to place that address in HTML code it would need to be escaped in HTML:

    http://domain.com/?name=data&tag=c%26t
That address can now be inserted as the value of an HTML anchor tag as such:

    <a href="http://domain.com/?name=data&amp;tag=c%26t">somewhere</a>
The important point to distinguish is that addresses are often used in contexts outside of HTML, even in the browser. For example the address bar at the top of the browser is outside the context of the view port that displays HTML content, and so the appropriate text there is:

    http://domain.com/?name=data&tag=c%26t
This is so because URI only has one encoding scheme, which is percent encoding.


I love articles like this, in part because they remind me of the early 90s when I first started reading about this stuff and it all seemed arcane and magical. [1]

I remember dismissing the "World Wide Web" because my 80286-based IBM PC I used at college in 1993 couldn't run a graphical web browser (that I knew of) so I compared the terminal versions of a web browser to Gopher and determined the latter was far superior - it had more content and was much cleaner to use in a terminal.

The history of the Internet and Web definitely would have been soooo different had URLs been formatted like "http:com/example/foo/bar/baz" for sure. It's so much cleaner and sensical. Part of the mystique of "foo.com" is that it somehow seems completely different from "bar.org". Not sure why, but it just is.

Just a side note: DOS and Windows using \ instead of / is annoying and has been annoying for nearly 40 years and I don't ever think I'll ever find it not annoying. You'd think 4 decades would be enough time, but it still bugs me.

1. https://en.m.wikipedia.org/wiki/Whole_Internet_User's_Guide_...


You could have viewed images on a 286 (or even an 8086/8088) using NCSA Telnet for DOS, just for the record.


> In 1992 Tim Berners-Lee created three things, giving birth to what we consider the Internet. The HTTP protocol, HTML, and the URL.

What we call the web, surely? I appreciate we conflate the two, but in this context I think that's what was meant.

And it's really the URL (URI, IRI, URN...) that makes the web. An amazing thing.


Agreed. I consider the URL the most important part of a web page's user interface.

I get incredibly frustrated with websites (e.g. many SPAs or frontend-happy sites) that don't provide a URL that I can bookmark and share where it obviously makes sense to do so.


> And it's really the URL (URI, IRI, URN...) that makes the web.

Well Chrome devs seem to disagree, with the recent push to hide URLs, and so do SPA devs. Is the article meant to save the URL, or rather a fairwell? Anyway, URLs aren't particularly well-designed IMO, for their use of the ampersand char as separator in the query part alone, since these conflict with ampersand as used for starting entity references in HTML and other SGML docs.


> and so do SPA devs

??? - those usually use the URL bar in my experience.


In most SPAs, however, URLs aren't used like conventional URLs; they're treated as routes. I think that's more the point here.

They aren't conventionally "shareable", first off, which violates more or less the spirit of what a URL is supposed to represent on the Web.


This is pretty much untrue. Gmail is an SPA for example, and I can share "https://mail.google.com/mail/u/0/#spam" with you, or "https://mail.google.com/mail/u/0/#inbox/FMfcgxwGDWqXKSSfXBDQ... (of course you can't access the latter because you don't have the appropriate permissions for it … but point still stands: I can open that URL in a new tab and it'll open the same email).

I've written plenty of SPAs myself that use the URL and you couldn't actually tell it's an SPA unless you have the technical experience to know what to look for.

If what you're saying is "this infinite-scrolling page sucks at respecting URLs", that's something else, and it's not specific to SPAs. Infinite scrolling predates SPAs (Here is one for jQuery! https://infinite-scroll.com/#initialize-with-jquery), and doesn't have to break URLs if you don't write your code like an uncivilized brute.


Breaking the back button and not using propers URLs are two cardinal sins in SPA development.


Here's the original, classic "Cool URIs Don't Change" post by TBL himself: https://www.w3.org/Provider/Style/URI.html


Interestingly, the canonical link to that page is actually: https://www.w3.org/Provider/Style/URI

"You may not be using HTML for that page in 20 years time, but you might want today's links to it to still be valid. The canonical way of making links to the W3C site doesn't use the extension."

I am constantly surprised by how difficult many static site generators make it to create URLs of this format (notice no trailing slash, before you tell me about the "Pretty URLs" options in most software).


Ha! How ironic that (in part bc on phone and hasty) I grabbed the wrong URL.


Anyone interested in the development of the ARPANET and its transformation into the Internet we know today owes it to themselves to read Where Wizards Stay Up Late by Hafner and Lyon - it’s a great read and the audiobook isn’t bad either.


If you're going to write a mile-long article about the URL, at least get the details correct. The leftmost part of a URL is not the "protocol". It is called the "scheme". The scheme doesn't tell you "the protocol which should be used to access it", it tells you how to interpret the remainder of the URL.


Except HTTP stands for “HyperText Transfer Protocol


There's nothing in "data" or "file" that stands for /Protocol/.


You suddenly realize you are very old when an article tries to impress with ancient photos of PDP11 and cradle-type dial-up modems and it's startling that I've used all of that in my lifetime, extensively.


> Root DNS servers operate in safes, inside locked cages. A clock sits on the safe to ensure the camera feed hasn’t been looped.

That is cool. Does anyone have any more info? Perhaps a picture?


I saw one at a data center in El Segundo once. It was 2011 and about 60% of the DC was a single guy sitting on the floor quietly packing up MySpace servers.


Was it Tom Anderson?


What are the available solutions to the problem of locating other copies of webpages or documents online? (Let's assume that the page of interest is not on the Internet Archive.) This article mentions a few:

> I was able to find these pages through Google, which has functionally made page titles the URN of today.

> Given the power of search engines, it’s possible the best URN format today would be a simple way for files to point to their former URLs.

Daniel Bernstein proposes a document ID that can be found in search engines: https://cr.yp.to/bib/documentid.html

I actually started using this before, but found it to be clumsy and stopped using it.

Someone else has suggested a UUID instead: https://lobste.rs/s/xltmol/this_page_is_designed_last#c_nis6...

But that's still clumsy. I'd prefer something shorter.

Perhaps the title of the page is the best option, as people are more likely to have that saved than the UUID: https://lobste.rs/s/xltmol/this_page_is_designed_last#c_0snr...

> I imagine it’d be uncommon for someone has the UUID but not the website saved.


If a page is available on IPFS, then the page would be accessible under a URL based on the page's hash, and anyone with a cached/saved copy of the page would be able to help host it at that URL. (IPFS works very similar to torrents with magnet links.)


Interesting. How would IPFS handle updates to a page? I assume that the updated page would have to be hosted separately.


Yeah, that generally applies to content-addressable systems like it (and torrents). An updated version of the page would have a new hash and a separate IPFS link.

To make a URL for content that can change, you would need to use a system that lets you create a mutable URL, and then you can make that point to an IPFS link. DNS works well for that. You can add a "_dnslink" TXT DNS record to a domain, and then when anyone with an IPFS-supporting browser (or browser add-on) accesses the domain, their browser can fetch the content over IPFS from anyone seeding it, and help seed it if the user wants. (Yes, this wouldn't work well at all for domains with dynamic content. Works great for domains that have static content, including sites made by static site generators, etc.)

I personally serve my blog with IPFS by making its files accessible over IPFS, putting a _dnslink TXT record on my domain pointing to the directory's current IPFS link, and then my domain's A record is pointing to a service (cloudflare-ipfs.com) that responds to HTTP requests by serving contents from the IPFS link that my _dnslink record points to. I'm using multiple free IPFS pinning services to keep my blog's files seeded on IPFS. I like that I'm not tied down to any of them, and I could easily replace them with other services or my own server without changing any of the rest of the setup. Additionally, anyone that liked my blog could help seed it themselves, so it could outlast the pinning services and me.

Assuming a world where IPFS was commonly-supported and my content was well-liked enough to get seeded by others, the only point of failure for keeping my site up is the domain name staying staying renewed and the DNS service I'm using staying up. Though if those went down, as long as someone still had the last IPFS link to my site, my site would still be accessible through that as long as people seeded it.

I believe the Ethereum Name Service would also be a good decentralized alternative to using DNS for keeping update-able URLs pointing to IPFS content, but I haven't personally used it and I don't know if there are good integrations that make it usable for that today. IPFS also has a feature called IPNS for creating mutable links to content that can be updated by whoever owns the private key, which sounds perfect on paper, but it doesn't work well in my experience for a few reasons (latency, timeouts, etc), so I wouldn't recommend it.


Thanks for all this information! I think I might do the _dnslink TXT record approach you mention for my own website (just a static site), though I'll first need to learn more about DNS and IPFS.


I remember when the hosts.txt file crossed a half page. I thought this network is taking off. :)


This is a masterful history and a reminder that what we have now is just a quasi-frozen snapshot of evolving solutions. Is there any chance that an alternative (URN? IPFS? Dat?) will gain traction and resolve some of the shortcomings of the URL?


When google first moved search to the address bar they started the transition to using page titles as addresses. This is a somewhat sloppy way to go about things, but when I visit this site, I just type "hacker news" not the FQDN.


bah, believe this was adapted from an article from 2016, commentary here: https://news.ycombinator.com/item?id=12117540


I started reading the article with much interest... up until the bit about the Semantic Web. Then I felt things went downhill.

> One such effort was the Semantic Web. The dream was to create a Resource Description Framework (editorial note: run away from any team which seeks to create a framework), which would allow metadata about content to be universally expressed. For example, rather than creating a nice web page about my Corvette Stingray, I could make an RDF document describing its size, color, and the number of speeding tickets I had gotten while driving it.

> This is, of course, in no way a bad idea. But the format was XML based, and there was a big chicken-and-egg problem between having the entire world documented, and having the browsers do anything useful with that documentation.

The author completely falls short to describe the evolution of the SemWeb over the past 10 years. Tons of specs, several declarative languages and technologies have been grown to not just get beyond the verbosity of a serialization format such as XML, but also move away from the classic relational data model.

Turtle, JSON-LD, SPARQL, Neo4J, Linked Data Fragments,... come to mind. And then there are the emerging applications of linked data. If anything, the Federated Web is exactly about URLs and semantic web technologies based on linking and contextualizing data.

The entire premise of Tim Berner Lee's Solid/Inrupt is based on these standards including URI's.

Linked data and federation isn't just about challenging social media, it's also about creating knowledge graphs - such as wikidata.org - and creating opportunities for things such as open access and open science.

Then there's this:

> httpRange-14 sought to answer the fundamental question of what a URL is. Does a URL always refer to a document, or can it refer to anything? Can I have a URL which points to my car?

> They didn’t attempt to answer that question in any satisfying manner. Instead they focused on how and when we can use 303 redirects to point users from links which aren’t documents to ones which are, and when we can use URL fragments (the bit after the ‘#’) to point users to linked data.

Err. They did.

That's what the Resource Description Framework is all about. It gives you a few foundational building blocks for describing the world. Even more so, URI's have absolutely NOTHING to do with HTTP status codes. It just so happens that HTTP leverages URI's and creates a subset called HTTP URL's that allows the identification and dereference of webbased resources.

You can use URI's as globally unique identifiers in a database. You could use URN's to identify books. For instance urn:isbn:0451450523 is an identifier for the 1968 novel The Last Unicorn.

So, this is a false claim. I could forgive them for inadvertently not looking beyond URL's as a mechanism used within the context of HTTP communication.

> In the world of web applications, it can be a little odd to think of the basis for the web being the hyperlink. It is a method of linking one document to another, which was gradually augmented with styling, code execution, sessions, authentication, and ultimately became the social shared computing experience so many 70s researchers were trying (and failing) to create. Ultimately, the conclusion is just as true for any project or startup today as it was then: all that matters is adoption. If you can get people to use it, however slipshod it might be, they will help you craft it into what they need. The corollary is, of course, no one is using it, it doesn’t matter how technically sound it might be. There are countless tools which millions of hours of work went into which precisely no one uses today.

I'm not even sure what the conclusion is here. Did the 'hyperlink' fail? did the concept of a 'URI' fail? (both are different things!) Because neither failed, on the contrary!

Then there's this wonky comparison of the origin of the Web with a single project or a startup. The author did the entire research on the history of the URI but they still failed to see that the Internet and the Web were invented by committee and by coincidence. Pioneers all over the place had good ideas, some coalesced and succeeded, others didn't. Some were adapted to work together in a piece-meal fashion such as Basic Auth.

And that's totally normal. Organic growth and distribute development is the baseline. Yes, the Web as we know it today is the result of many competing voices, but at the same time it could only work if everyone ended up agreeing over the basics.

The fact of the matter is that some companies - looking at you FAANG - would rather have us all locked in a closed, black-box ecosystems, rather then having open standards around that allow for interoperability, and thus create opportunities for new threats to challenge their business interests.

I understand that the article is written by CloudFlare, a CDN company with its own interests. But I'm trying to wrap my ahead around how the author failed in addressing exactly future opportunities and threats, after this entire exposé.


I understand that the article is written by CloudFlare, a CDN company with its own interests.

Not sure what you mean by that, but Zack wasn't writing something to further some secret interest Cloudflare has in the structure of URLs.


It's always prudent to be circumspect about the motives of writers of what is essentially marketing content.


I think the operative phrase is 'a CDN company'.

URLs are names for things (companies, mailboxes, pictures of cats), but they're also (encoded) directions to get representations of those named things.

CloudFlare is concerned with the mechanics mostly, the latter. Things like Wikidata, knowledge bases, schema.org are interested in the former perspective.


Cloudflare has Edge Workers and lots of other technologies that would be helpful to the semantic web.

Anything that is URL addressable is great for Cloudflare.


The author completely falls short to describe the evolution of the SemWeb over the past 10 years. Tons of specs, several declarative languages and technologies have been grown to not just get beyond the verbosity of a serialization format such as XML, but also move away from the classic relational data model.

Turtle, JSON-LD, SPARQL, Neo4J, Linked Data Fragments,... come to mind. And then there are the emerging applications of linked data. If anything, the Federated Web is exactly about URLs and semantic web technologies based on linking and contextualizing data.

I think that the author addressed this very well:

There is a popular perception that the internet standards bodies didn’t do much from the finalization of HTTP 1.1 and HTML 4.01 in 2002 to when HTML 5 really got on track. This period is also known (only by me) as the Dark Age of XHTML. The truth is though, the standardization folks were fantastically busy. They were just doing things which ultimately didn’t prove all that valuable.

One such effort was the Semantic Web.

Most of the things you listed were developed in that period. I'll make a partial exception for JSON-LD because - as the author of that standard himself says:

So screw it, we thought, let’s create a graph data model that looks and feels like JSON, RDF and the Semantic Web be damned.

and

I hate the narrative of the Semantic Web because the focus has been on the wrong set of things for a long time.

[1] http://manu.sporny.org/2014/json-ld-origins-2/


It's fair to say that there's a vast difference between assessing the usefulness of the technical output in this day and age on the one hand, and looking at past context - the decision processes, incumbents, power dynamics,... - in which that output was established.

> There is a popular perception that the internet standards bodies didn’t do much from the finalization of HTTP 1.1 and HTML 4.01 in 2002 to when HTML 5 really got on track. This period is also known (only by me) as the Dark Age of XHTML.

I think that's hindsight bias talking.

Who knew at the time how the next 20 years would play out. Google was just in it's infancy. Internet Explorer dominated the browser market and the same concerns - vendor lock-in and proprietary protocols - were just as much a thing back then as they are today.

HTML5 could emerge because of the wide adoption of XHTML and web standards by developers and designers. Not despite the existence of XHTML. The latter is just heavily colored value attribution on the part of the author.

> The truth is though, the standardization folks were fantastically busy. They were just doing things which ultimately didn’t prove all that valuable.

I think this applies to literally any sizable enterprise as rising complexity diminishes predictability. The only way to find out whether or not a complex enterprise is valuable is... by going down that road and test your ideas.

The implication made here is a take against standards bodies not following market dynamics - doing market research; following dominant technologies - but instead impose their own principled vision on a market.

But that's a false dichotomy. If anything, standards bodies are committees in part made up of people who are also affiliated or represent incumbents in the marketplace. And in part they are made up of people who defined interests groups outside of commercial ventures such as academia, research, public governance, and so on.

The output of a standards body is by very definition a compromise that doesn't tailor the specific needs and wants of a single actor. That's actually a good thing.

> I hate the narrative of the Semantic Web because the focus has been on the wrong set of things for a long time.

The author is correct. The RDF spec has a lot of shortcomings. And the Semantic Web discussion was a difficult debate for a long time because things hadn't coalesced in a clear vision. And that's not a bad thing.

Context matters. At that time, nobody knew what the SemWeb was supposed to become or into what it would evolve. It was simply an idea and there were a few tacit attempts to work in a problem space that wasn't fully charted yet. It's hard to navigate if you don't know the lay of the land, right?

This blogpost was written when the final recommendation of JSON-LD was published. And that specification could only emerge after it was clear that the direction of the debate wasn't leading to nowhere.

All I see is a normal evolution of things in an R&D context. By the same token, you could argue that the telegraph was a useless device because usage declined and nobody is using that technology anymore. But then you'd disregard the fact that the existence and use of telegraphs inspired others to create improvements such as the telephone or the radio.


All I see is a normal evolution of things in an R&D context.

Which is fine, except the premature standardization approach used by Semantic Web technologies destroyed any chance they had of working.

HTML5 could emerge because of the wide adoption of XHTML and web standards by developers and designers. Not despite the existence of XHTML. The latter is just heavily colored value attribution on the part of the author.

Actually, no. HTML5 forked from HTML4, not XHTML because the W3C had a different vision.

This isn't the just authors view: I followed the mailing list and it's pretty well understood.

See for example the "A Competing Vision" in https://diveinto.html5doctor.com/past.html


Thanks for reading and for your careful analysis. My perspective lives in the last paragraph of the post and I'll let that stand.


>> They didn’t attempt to answer that question in any satisfying manner. Instead they focused on how and when we can use 303 redirects to point users from links which aren’t documents to ones which are, and when we can use URL fragments (the bit after the ‘#’) to point users to linked data.

> Err. They did.

> That's what the Resource Description Framework is all about. It gives you a few foundational building blocks for describing the world. Even more so, URI's have absolutely NOTHING to do with HTTP status codes. It just so happens that HTTP leverages URI's and creates a subset called HTTP URL's that allows the identification and dereference of webbased resources.

> You can use URI's as globally unique identifiers in a database. You could use URN's to identify books. For instance urn:isbn:0451450523 is an identifier for the 1968 novel The Last Unicorn.

> So, this is a false claim. I could forgive them for inadvertently not looking beyond URL's as a mechanism used within the context of HTTP communication.

See this is almost the canonical example of why the semantic web remains the once and always future of the web.

Take this: Even more so, URI's have absolutely NOTHING to do with HTTP status codes. It just so happens that HTTP leverages URI's and creates a subset called HTTP URL's that allows the identification and dereference of webbased resources.

Sure, URIs are just the addressing scheme. I think we all get that. But the practicalities of building systems means that the applications have to understand both the addressing scheme, and some way of handling errors which status codes supply. Notably all implementations of URIs (HTTP, Files, IPFS) have to implement error handling themselves.

The holistic approach that the (non-semantic) web took in evolving the browser, HTML, and HTTP together meant that practical applications could be built on it.

Contrast that to the ideological approach of the semantic web, where - yes, Resource Description Framework (RDF) gives you addresses, but it's a weak data modelling approach that would be ignored if it was in a programming language (eg, the lack of list support! - see [1])

Anyway, to go to your original point: the original httpRange-14 was in the context of HTTP URIs, but the issue equally applies to non-HTTP URIs. At least for HTTP we can discuss it sensibly because status codes are part of the spec. For URIs in a general sense it seems impossible to resolve this (no pun intended).

[1] See Decision 3 in http://manu.sporny.org/2014/json-ld-origins-2/ (or read the whole article. It's good).




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: