
BSD Sockets Addressing - beefhash
https://idea.popcount.org/2019-12-06-addressing/
======
andoma
Somewhat related, IPv4 addresses can also be represented in a shorter
classfull form NET.HOST, like

    
    
      10.1 == 10.0.0.1
      192.168.1 == 192.168.0.1
    

While it has practically no use in today's CIDR [1] internet, it's still a
useful shortcut for home-networks, etc (I have my home network configured as
10.0/24)

    
    
      $ ping 10.1
      PING 10.1 (10.0.0.1) 56(84) bytes of data.
      64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.181 ms
    

I've also recently stumbled on some software that no longer parses this
address form. Can't remember where though.

[1] [https://en.wikipedia.org/wiki/Classless_Inter-
Domain_Routing](https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing)

~~~
fanf2
1980s style IPv4 address syntax has been deprecated for 20 years:

[https://tools.ietf.org/html/rfc2553#page-31](https://tools.ietf.org/html/rfc2553#page-31)

[https://pubs.opengroup.org/onlinepubs/9699919799.2018edition...](https://pubs.opengroup.org/onlinepubs/9699919799.2018edition/functions/inet_pton.html)

Anything that continues to support the ancient syntax is in serious need of
fixing with extreme prejudice.

~~~
meddlepal
Why does it need to be "fixed" if it works fine? Just because is it is
deprecated doesn't mean it is broken.

~~~
orf
It means it will be broken, and usually is deprecated because it _is_ broken
in some way.

------
majke
Author here. This article is part of series.

(1) Creating sockets

(2) Addressing

soon to come

(3) AF_INET6 archeology: scope id and flowinfo

(4) bind/connect

I'm trying to build an audience, so make sure to follow me on twitter
[https://twitter.com/majek04](https://twitter.com/majek04)

(I don't like harvesting emails, so opted for twitter)

~~~
beefhash
I've read your post about a brief history select(2) and kind of tried to go
down with archaeology as to how the BSD sockets API came to be the way it is
(struct sockaddr before struct sockaddr_storage, connect and bind being
separate states, read/write v. recv/send), but it's been very hard trying to
get complete documentation on the evolution since the first BBN drafts.

Food for extra thought I suppose; there's probably a lot of knowledge and
trivia to be gleaned from the historical evolution of the interfaces, and I'd
imagine you're in a better position to actually sink your fangs into this.

Incidentally, your post is time travelling and claims being from Dec 6, 2019.

------
schmichael
A major gotcha for using file-based UNIX sockets for communicating with
containers is that UNIX sockets have far shorter path length limits than
anything else. This bit me when I implemented Nomad's Consul Connect
integration and used a UNIX socket to allow Envoy in a container to contact
Consul on the host:

[https://github.com/hashicorp/nomad/issues/6516](https://github.com/hashicorp/nomad/issues/6516)

~~~
quotemstr
It'd be easy enough to fix in the kernel. The challenge would be overcoming
the chorus of "no". It's hard to add fixes even for clear and long-standing
problems when you have to respond to "do we REALLY need this change?" with an
overwhelming amount of evidence.

~~~
duncaen
The size for `sockaddr_storage` is not defined by POSIX, but `sockaddr_un` is
defined, and you can't just change the `sun_path` to a pointer, so to increase
`sun_path` you would have to increase the `sockaddr_storage` struct size. This
comes with other downsides, first is incompatibility with other OSes, most
OSes seem to be around 104-109 bytes. Second new problem would be that with
any larger value each socket call would have to copy more data, even if they
are not unix sockets. If you change it to something that looks sensible like
`PATH_MAX` you end up increasing memory requirements for any application that
works with sockets.

~~~
quotemstr
The relevant functions take size parameters. We don't have to limit ourselves
to sockaddr_storage.

------
quotemstr
It's always bugged me that AF_UNIX addresses have a relatively short length
limit. If you look at the actual kernel ABI, you'll see that the sun_addr
structure can be made as long as necessary. The kernel could support long unix
socket addresses without new APIs.

I also don't like abstract sockets at all. They're vulnerable to squatting
attacks. The nice thing about non-abstract unix sockets is that they take
advantage of the filesystem permissions that everyone already understands.
When possible, I prefer putting things into a single namespace.

Unix sockets (of the non-abstract sort) also need a way to atomically unlink
an existing socket and bind in its place. Doing that as two operations is
pretty common, but it's racy.

~~~
jlokier
> Unix sockets (of the non-abstract sort) also need a way to atomically unlink
> an existing socket and bind in its place. Doing that as two operations is
> pretty common, but it's racy.

To atomically unlink an existing socket and bind another in place, without
races, you should be able to first bind a new socket to a temporary name in
the same directory, and then use rename(2) to atomically replace the existing
name.

There may be queued connections that need accept(2) in the old socket after
the rename, but that's not a race condition.

~~~
quotemstr
My complaint wasn't detailed enough, What I really want is a way to atomically
rebind a socket if nobody else is listening on that socket --- the stale
process cleanup case. I _want_ the bind to fail of a socket is live. Does that
make sense? Haven't had enough coffee yet today.

~~~
jlokier
It makes sense but I don't think it actually gains you anything because of a
subtlety around races.

If you want the rebind to fail only when the socket is bound by another
process, it may be the other process is _about_ to close the socket, and your
atomic-rebind would result in the socket not bound by either process.

That outcome is identical to the outcome from the available method, where you
attempt connect() first and only bind-and-rename if the connect() fails with
an appropriate error. When there's no race, you always get the desired outcome
of the socket bound by exactly one process, but if the processes are racing,
the result can be the socket not bound by either process.

Because both methods produce the same outcome in the race case, and the only
difference is an _unobservable_ difference in each process' logical clocks
(you can't tell which events really happened first), both methods actually
have the same race properties.

If you want to _always_ end up with exactly one server running, and to
_always_ avoid taking away the socket from an existing server which is running
just fine, I think you need to involve some protocol. You can try connect()
and then if it succeeds, send a request "are you shutting down?". An
appropriate type of connect() error or "yes" means it's safe to bind a new
socket and rename over, otherwise leave the socket alone.

------
derefr
I’ve always found it annoying that we have these multiple types of socket
addressing, which were all invented before there was any concept of a URL, and
yet there’s no clear bijective mapping between arbitrary socket address
formats, and URL origins. I.e., you can’t just create a URL where the origin
part is specified as a Unix domain socket path. (Or any other kind of format
besides AF_INET and AF_INET6, despite there having been many other weird
historical ones.)

Which means everyone who needs their client to support connecting to an RPC
server over both AF_INET/6 sockets and AF_UN sockets, needs to invent some
custom metaformat above URLs for specifying that they want a connection
through a Unix socket path. Kind of ridiculous, really.

~~~
fulafel
I'm not sure there would be any difficulty supporting Unix socket paths in
your URI scheme? I guess you can just steal the semantics from the file://
scheme.

Or do you want the same URI scheme to support all your different transport
protocols? It kind of feels like missing the point of URIs, after all the
whole idea is that your application can interchangeably handle different kinds
of URIs for different use cases.

~~~
derefr
Yes, in my ideal world, URIs would be a sequence of "steps" to resolving a
resource, where each step starts off with a nominal "resolver object" it's
working against, and then names a scheme (method) available upon that resolver
object, where the rest of that step of the URL are then method parameters to
that abstract call, resulting in either another "resolver object", or a fully-
resolved resource. The scheme of the final step then determines the protocol
for the interaction with the resource.

This is somewhat, but not perfectly, analogous to how 9P mounts work in Plan9.
In both models, you start off with your local namespace, which has servers
mounted on it like /mnt/web or /mnt/ftp. These are the "schemes" available on
the initial resolver object. You then build a path to a virtual file within
one of these mountpoints (/net/tcp, say), where that path turns out to point
to a resource that describes (in a format another 9P server understands) how
to create another mount representing that connection.

The difference between the two models is that, with 9P, you'd have to take
this descriptor file and mount it yourself; whereas in my model, you wouldn't
see a descriptor file, but rather another already-mounted server, with its own
schemes as its root-level exposed virtual directories.

Pretending 9P worked this way, you'd be able to have a namespace path that
looks like, e.g.:

    
    
        /net/tcp/foo.com/80/http/patha/pathb/a=b/c=d
    

Back in URL land, this kind of path would correspond to a "meta-URL" that
looks like e.g.:

    
    
        tcp:foo.com:80#http:/patha/pathb?a=b&c=d
    

where the "#" symbol now takes on new significance as a command for the URL
resolver, to address the next part of the path to the resolver-object
discovered by resolving everything on the meta-URL so far. (Which perfectly
subsumes #'s current use, since browsers interpret a # as identifying a
particular DOM-tree node within a document—i.e. as using the resource acquired
from the previous URL-step as a resolver to find the next object.)

You could do all sorts of crazy things with this, too, like creating URLs that
represent files available within archives that exist as HTTP streams, e.g.:

    
    
        tcp:/foo.com/80#http:/foo/bar.tar#file:/index.html
    

Or adding DNS or TLS arbitrarily into protocol stacks that weren't designed
for them:

    
    
        ip:/dns/foo.com#tcp#tls#irc:%23room

