
The History of the URL: Domain, Protocol, and Port - zackbloom
https://eager.io/blog/the-history-of-the-url-domain-and-protocol/
======
gumby
This is a good article. A few nits:

It was _the_ ARPANET (or _the_ arpanet since most systems were case-
insensitive in those days - Multics, and later Unix, were exceptions, not the
rule) as in area's network, that used arpanet protocols like NCP. You do use
"the" the first time in your article but seem to have dropped it after that.

CHAOSNET was just a LAN protocol like "ethernet" or pup -- also used what we
call 10base2 "thicknet" coax. It was developed at MIT's AI Lab and was pretty
much used only there and at a few institutions close to MIT like Symbolics and
LMI.

In the NCP days routing was handled by IMPs (Interface Message Processors)
which were not PDPD-11s, and when '11s were used they were smaller than the
11/70s which you used to illustrate the article (11/70s were the largest
PDP-11s made -- still 16 bit unlike the 36-bit PDP-10s which were the mainstay
of academic computer science in those days).

> In this era before ‘mail servers’, if my computer was off you weren’t
> sending me an email.

In that era few people had what you would consider a personal computer and
more likely you logged into a timesharing system that had your mail along with
everything else. So your statement is true, yet an anachronism. Even if you
did have your own host, the upstream host (one earlier in the ! path) would
have your message so you could consider it literally your mail server.

------
wahern
DNS was never ASCII only, and I've never seen DNS software make that
assumption--that "every piece of internet hardware from the last fourty years,
including the Cisco and Juniper routers used to deliver this page to you
[assumes ASCII]".

The essay links to RFC 1035 to support its claim of ASCII only, but RFC 1035
actually says is

"However, future additions beyond current usage may need to use the full
binary octet capabilities in names, so attempts to store domain names in 7-bit
ASCII or use of special bytes to terminate labels, etc., should be avoided."

and

"Although labels can contain any 8 bit values in octets that make up a label,
it is strongly recommended that labels follow the preferred syntax described
elsewhere in this memo, which is compatible with existing host naming
conventions. "

Indeed, some country TLD servers were (and maybe still are) supporting non-
punycoded UTF-8 directly.

Lookups are supposed to be case-insensitive, but it's always been verboten to
actually modify the case of names in a DNS packet. A query reply is supposed
to include the identical question name in an 8-bit clean manner. Indeed, some
DNS clients will arbitrarily randomize the case of names to add an element of
randomness to thwart DNS spoofing attacks. (If the answer isn't the same 8-bit
name, you ignore it just as if it came from a different IP address then you
sent it to.) Unfortunately there exist enough broken DNS proxies out that
software like Firefox or Chrome can't do this without headaches, but I've
never encountered such broken software myself (at least, not that I knew
about). At worst I've seen query responses which lack the question portion
altogether, and this can cause timeouts (rather than immediate failures) for
software which enables anti-spoofing measures. But I've also seen responses
which lack the same QID, too. There's always broken software; the threshold
for when you can ignore it is highly context dependent.

~~~
bluejekyll
> some DNS clients will arbitrarily randomize the case of names to add an
> element of randomness to thwart DNS spoofing attacks.

I believe this is undefined behavior. It shouldn't be something you count on.
The only reference I found in the spec that implies this is:

 _The question section of the response matches the question section of the
query_

From rfc 1034. Which isn't very specific, but could be interpreted by some in
the way you mean.

If you want to secure the request, it's best to randomize the QID and outbound
port. If a server responds with the wrong QID, I'd ignore it.

~~~
wahern
From RFC 1034 S. 3.1: "[D]omain name comparisons for all present domain
functions are done in a case-insensitive manner, assuming an ASCII character
set, and a high order zero bit. When you receive a domain name or label, you
should preserve its case. The rationale for this choice is that we may someday
need to add full binary domain names for new services; existing services would
not be changed."

First, we can't speak of it being undefined in the same manner as we do
undefined in the C standard. The DNS standards weren't this rigorous, and
didn't use consistent terminology like MUST and SHOULD universal in today's
RFCs.

Second, they were explicit that while the existing services (e.g. IN class and
A record type) were ASCII-based and case-insensitive, the binary protocol was
meant to be 8-bit clean, that some labels might be 8-bit in the future, and it
was expected and mandated that this capability be preserved. So strictly
speaking, the RFC allowed a server to, e.g., modify the case of an A record
label on the wire, but not of some unknown label. In practice it's easier to
simply treat all labels in an 8-bit clean manner, and that's in fact what
major implementations do. You literally have to go out of your way to do
otherwise while still obeying the standard.

Caching name servers like BIND and unbound will reply with the identical
question label. For example, notice in the following how the TTL is
decremented (and thus being pulled from cache) but the query case is
preserved:

    
    
      % dig -t A google.com                               
      ; <<>> DiG 9.8.3-P1 <<>> -t A google.com
      ;; global options: +cmd
      ;; Got answer:
      ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20838
      ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
      
      ;; QUESTION SECTION:
      ;google.com.			IN	A
      
      ;; ANSWER SECTION:
      google.com.		105	IN	A	172.217.4.206
      
      ;; Query time: 0 msec
      ;; SERVER: 192.168.2.1#53(192.168.2.1)
      ;; WHEN: Sat Jul  9 00:45:57 2016
      ;; MSG SIZE  rcvd: 44
    
      $ dig -t A GoOgLe.com                               
      ; <<>> DiG 9.8.3-P1 <<>> -t A GoOgLe.com
      ;; global options: +cmd
      ;; Got answer:
      ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 7947
      ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0
      
      ;; QUESTION SECTION:
      ;GoOgLe.com.			IN	A
      
      ;; ANSWER SECTION:
      GoOgLe.com.		95	IN	A	172.217.4.206
      
      ;; Query time: 0 msec
      ;; SERVER: 192.168.2.1#53(192.168.2.1)
      ;; WHEN: Sat Jul  9 00:46:07 2016
      ;; MSG SIZE  rcvd: 44
    
    

In reality, the core DNS infrastructure was perfectly capable of fully
supporting raw UTF-8 labels (Though a DJB page suggests that some older
versions of Unix gethostbyname stripped 8-bit labels.) Unlike other
infrastructure, the implementations were fairly homogenous (until a few years
ago BIND absolutely dominated), so ad hoc (and broken) implementations were
few and far between. And unlike other infrastructure, there was very little
incentive to violate 8-bit cleanliness. The biggest problems were not that
some ad hoc implementations modified case, per se, but that some ad hoc
caching proxies would reply with the case of a cached record. That's out of
sheer laziness, or because they didn't read the standard closely enough. It's
telling that BIND, unbound, and other major caching proxies are careful to
preserve case in the reply even though that's not necessarily the easiest
solution.

The real problem was edge software, like browsers, e-mail clients, etc, that
baked in way more assumptions than warranted. Arguably IDNA and punycode took
more effort to roll out than would have alternatives based on raw UTF-8. The
core infrastructure software wasn't a real barrier, and the IDNA solution
required more code at the edges. While the major browsers were facing lots of
work regardless, most ad hoc software would have been fine just fixing 8-bit
cleaniness problems and then punting on things like glyph security issues,
especially if they weren't directly user facing. The vast majority of edge
software would have just required some slight refactoring, not huge rewrites
with library dependencies for the new compression scheme, etc.

------
rconti
> The first 32 identified the remote host, similar to how an IP address works
> today. The last eight were known as the AEN (it stood for “Another Eight-bit
> Number”), and were used by the remote machine in the way we use a port
> number

Gold.

Great read, it hits home for me with the right mix of nostalgia, history from
before my time, and funny little things I never knew.

------
b15h0p
Another nitpick: on iOS Safari, that pizza-poo-domain name actually does show
up in the address bar. So there has to be another mechanism that prevents the
Amazon-with-Cyrillic-"a"-trick which I guess involves normalization.

------
gregrata
Great read! "Thanks" for all the Wikipedia links - I ended up wasting a few
hours reading more details

------
echeese
For http:com/example/foo/bar/baz how would you determine what the host is?

~~~
wtbob
It doesn't actually matter: in a world which used that sort of addressing, one
could imagine saying to com 'give me HTTP info for your example/foo/bar/baz,'
to com/example 'give me HTTP info for your foo/bar/baz' and so forth; in that
case, com would just say, 'hey, go talk to 266.328.0.1 (that's what I call
example)' and 266.328.0.1 would cheerfully return the information stored at
the filesystem path /foo/bar/baz, or it could say, 'hey, I call foo
463.622.42.17' and your browser would keep resolving.

Me, I kinda wish we wrote URLs as
[http://com.example.host.invalid/path/to/resource](http://com.example.host.invalid/path/to/resource).

~~~
gumby
> Me, I kinda wish we wrote URLs as
> [http://com.example.host.invalid/path/to/resource](http://com.example.host.invalid/path/to/resource).

The UK's predecessor to the DNS worked this way (a "big endian" hierarchy).
Sorry I can't remember the network name; if I remember it was rooted in gb.

~~~
Theodores
I am old enough to remember the final year of JANET with the backwards
addresses.

At the time I was at Plymouth Marine Laboratory with a university email
address of the type researcher@uk.ac.pml - i.e. backwards. However, in those
days there were many different things about networks, you could have several
connector types in a room so anything beyond email was a bit like the
difference between travelling across state borders in the U.S. and travelling
across the Iron Curtain. I can't remember how one got from one's VT terminal
to the wider internet on VAX/VMS but that was possible. FTP and some Telnet
was how it worked, none of this www stuff.

The change of address structure to normal internet style was not that big of a
change, you would think it would have been as traumatic as changing what side
of the road to drive on or the Millennium Bug, but, the change happened with
no huge amount of work needed or resultant disruption.

~~~
rjsw
I am old enough to have been a JANET site administrator.

The JANET when I used it ran over a private X.25 network with a few gateways
to BT's public X.25 network. There was a gateway to the internet at University
of London Computer Centre but it only provided an FTP client.

------
userlabs
very good info thanks

------
ChristianBundy
> This restriction on HTML was ultimately removed in 2007 and that same year
> Unicode became the most popular encoding on the web.

Nitpick: Unicode is a character set, UTF-8 is an encoding.

~~~
zackbloom
Should be fixed shortly, thanks!

------
bluejekyll
Nit pick: the modern internet is built on IP, not TCP/IP, but it would be fair
to say most protocols use TCP today, but definitely not all.

------
RickHull
Great article. Another nitpick:

> _It’s important to dispel any illusion that these decisions were made with
> precence for the future the domain name would have._

I don't think _precence_ is a word, and I'm not sure what would make sense as
its replacement.

~~~
juliendorra
Prescience (knowledge of things before they happen)

~~~
schoen
Now I wonder if "foreknowledge" could have come into English as a calque of
Latin "praescientia".

