Hacker News new | past | comments | ask | show | jobs | submit login
BSD Sockets Addressing (popcount.org)
75 points by beefhash 15 days ago | hide | past | web | favorite | 34 comments



Somewhat related, IPv4 addresses can also be represented in a shorter classfull form NET.HOST, like

  10.1 == 10.0.0.1
  192.168.1 == 192.168.0.1
While it has practically no use in today's CIDR [1] internet, it's still a useful shortcut for home-networks, etc (I have my home network configured as 10.0/24)

  $ ping 10.1
  PING 10.1 (10.0.0.1) 56(84) bytes of data.
  64 bytes from 10.0.0.1: icmp_seq=1 ttl=64 time=0.181 ms
I've also recently stumbled on some software that no longer parses this address form. Can't remember where though.

[1] https://en.wikipedia.org/wiki/Classless_Inter-Domain_Routing


1980s style IPv4 address syntax has been deprecated for 20 years:

https://tools.ietf.org/html/rfc2553#page-31

https://pubs.opengroup.org/onlinepubs/9699919799.2018edition...

Anything that continues to support the ancient syntax is in serious need of fixing with extreme prejudice.


Why does it need to be "fixed" if it works fine? Just because is it is deprecated doesn't mean it is broken.


It means it will be broken, and usually is deprecated because it is broken in some way.


In a similar vein, '0' is INADDR_ANY, which in connect() usually means the local machine (except Windows?).

So for webapp development, http://0:3333 is about as convenient a URL you can get, except it doesn't work in Chrome


Doesn't the "doesn't work in [some browser]" make it... not convenient?

Is there a problem with localhost:3333 (or whatever)?


It's eight characters longer?


Author here. This article is part of series.

(1) Creating sockets

(2) Addressing

soon to come

(3) AF_INET6 archeology: scope id and flowinfo

(4) bind/connect

I'm trying to build an audience, so make sure to follow me on twitter https://twitter.com/majek04

(I don't like harvesting emails, so opted for twitter)


I've read your post about a brief history select(2) and kind of tried to go down with archaeology as to how the BSD sockets API came to be the way it is (struct sockaddr before struct sockaddr_storage, connect and bind being separate states, read/write v. recv/send), but it's been very hard trying to get complete documentation on the evolution since the first BBN drafts.

Food for extra thought I suppose; there's probably a lot of knowledge and trivia to be gleaned from the historical evolution of the interfaces, and I'd imagine you're in a better position to actually sink your fangs into this.

Incidentally, your post is time travelling and claims being from Dec 6, 2019.


A major gotcha for using file-based UNIX sockets for communicating with containers is that UNIX sockets have far shorter path length limits than anything else. This bit me when I implemented Nomad's Consul Connect integration and used a UNIX socket to allow Envoy in a container to contact Consul on the host:

https://github.com/hashicorp/nomad/issues/6516


It'd be easy enough to fix in the kernel. The challenge would be overcoming the chorus of "no". It's hard to add fixes even for clear and long-standing problems when you have to respond to "do we REALLY need this change?" with an overwhelming amount of evidence.


The size for `sockaddr_storage` is not defined by POSIX, but `sockaddr_un` is defined, and you can't just change the `sun_path` to a pointer, so to increase `sun_path` you would have to increase the `sockaddr_storage` struct size. This comes with other downsides, first is incompatibility with other OSes, most OSes seem to be around 104-109 bytes. Second new problem would be that with any larger value each socket call would have to copy more data, even if they are not unix sockets. If you change it to something that looks sensible like `PATH_MAX` you end up increasing memory requirements for any application that works with sockets.


The relevant functions take size parameters. We don't have to limit ourselves to sockaddr_storage.


... or: The danger of defining things via C structs. (I'm being a teeny tiny bit facetious. It could theoretically be done reasonably via C structs, but the history of POSIX, etc. is littered with historical accidents that are causing weird and avoidable problems to this day.)


Very interesting, I didn't know that. On my macOS system, there are easily a dozen paths in /var that exceed 100 chars. For example becase they have a UUID in them.


It's always bugged me that AF_UNIX addresses have a relatively short length limit. If you look at the actual kernel ABI, you'll see that the sun_addr structure can be made as long as necessary. The kernel could support long unix socket addresses without new APIs.

I also don't like abstract sockets at all. They're vulnerable to squatting attacks. The nice thing about non-abstract unix sockets is that they take advantage of the filesystem permissions that everyone already understands. When possible, I prefer putting things into a single namespace.

Unix sockets (of the non-abstract sort) also need a way to atomically unlink an existing socket and bind in its place. Doing that as two operations is pretty common, but it's racy.


> It's always bugged me that AF_UNIX addresses have a relatively short length limit. If you look at the actual kernel ABI, you'll see that the sun_addr structure can be made as long as necessary. The kernel could support long unix socket addresses without new APIs.

There is no limit except the typical file path limit. The kernel does support longer paths. It will obey the third, socklen_t, parameter to both connect(2) and bind(2). The size of .sun_path from the kernel's perspective extends to the end of the sockaddr structure declared by the socklen_t parameter.

This applies not only to Linux but all Unix-like systems except, IIRC, Minix.

The flip side is that you have to check the return values of accept(2), getsockname(2), and getpeername(2) for truncation. The full path of the socket may not fit in the size of the structure you pass.


> Unix sockets (of the non-abstract sort) also need a way to atomically unlink an existing socket and bind in its place. Doing that as two operations is pretty common, but it's racy.

To atomically unlink an existing socket and bind another in place, without races, you should be able to first bind a new socket to a temporary name in the same directory, and then use rename(2) to atomically replace the existing name.

There may be queued connections that need accept(2) in the old socket after the rename, but that's not a race condition.


My complaint wasn't detailed enough, What I really want is a way to atomically rebind a socket if nobody else is listening on that socket --- the stale process cleanup case. I want the bind to fail of a socket is live. Does that make sense? Haven't had enough coffee yet today.


It makes sense but I don't think it actually gains you anything because of a subtlety around races.

If you want the rebind to fail only when the socket is bound by another process, it may be the other process is about to close the socket, and your atomic-rebind would result in the socket not bound by either process.

That outcome is identical to the outcome from the available method, where you attempt connect() first and only bind-and-rename if the connect() fails with an appropriate error. When there's no race, you always get the desired outcome of the socket bound by exactly one process, but if the processes are racing, the result can be the socket not bound by either process.

Because both methods produce the same outcome in the race case, and the only difference is an unobservable difference in each process' logical clocks (you can't tell which events really happened first), both methods actually have the same race properties.

If you want to always end up with exactly one server running, and to always avoid taking away the socket from an existing server which is running just fine, I think you need to involve some protocol. You can try connect() and then if it succeeds, send a request "are you shutting down?". An appropriate type of connect() error or "yes" means it's safe to bind a new socket and rename over, otherwise leave the socket alone.


> It's always bugged me that AF_UNIX addresses have a relatively short length limit.

I agree, it's not pretty. Some unixes have an extremely short limit, so short you can't really use absolute paths portably unless in somewhere like /tmp.

But there is a workaround.

The path doesn't have to be absolute, so if you chdir(2) first and use a relative name with bind() and connect(); that gets you a socket in any directory.

(See also bindat() and connectat() in FreeBSD).

If you don't want to chdir(), you can make a temporary symbolic link in /tmp to your directory of choice for bind(), and directly to the socket for connect().

If the last path component is too long for bind(), renaming afterwards may work (or hard linking on the same filesystem then unlinking).

All the above assumes that AF_UNIX sockets are indexed by the inode on the filesystem, so different paths, rename etc work. I'm not sure if that is true on every old and weird kernel, if you're going for high portability.


Abstract sockets sound, to me, kind of like an attempt to retrofit microkernel-like ports back onto UNIX domain sockets for the lack of a better mechanism. It might be worthwhile to consider giving up on that idea entirely and introduce a new IPC primitive (that must be compatible with D-Bus for it to find any kind of widespread adoption).


I don't see a need for a new primitive. What's wrong with tweaking Unix sockets?


You inherit all the weirdness from UNIX domain sockets, such as having the semantics of SOCK_STREAM, SOCK_DGRAM or SOCK_SEQPACKET (or finding out that your OS of choice doesn't support the one you wanted) depending on which one you picked.

Abstract sockets have no permissions. Not having to deal with the filesystem at all is nice, but having absolutely no security model whatsoever also seems dangerous.

You have the ugly issue of having to deal with handling zeroes in your name, as noted in the article. This can and will break something, somewhere on the first attempt of writing it.

When you want microkernel IPC, you usually tend to want a message-oriented reply-and-receive primitive to main loop around them. Emulating this with sockets is error-prone since send(2) and recv(2) make extremely weak promises about their behavior on error. sendmsg(2) and recvmsg(2), which are necessary to pass kernel objects around (i.e. file descriptors), are very difficult to use.

Sockets are nice because you can select(2) etc. on them, however. I'd expect a replacement would interoperate smoothly with at least those system calls.


All of those communication modes are actually useful. Seqpacket works especially well. You can build whatever high level facilities you want on top of them: I have several times. Ease of use at the raw system call level is not relevant: application developers shouldn't be working at that level anyway.

If you want object IPC with a reply and receive operation, you can use Android's binder, which is already in mainline.


> you can use Android's binder, which is already in mainline

This sounds interesting, and as a regular Linux dev, I’ve never heard of it. Is there a tutorial for this that doesn’t assume you’re coming from an Android background?


> I also don't like abstract sockets at all. They're vulnerable to squatting attacks.

TCP and UDP port numbers are the same. Anybody can bind any port number >= 1024, which blocks another server from binding that port.


Yes. And now that Windows supports AF_UNIX sockets, nobody had an excuse for binding random ports on localhost for IPC.


Windows always had named pipes, which are similar enough to AF_UNIX that you could use them instead.


Unless you want to support older Windows. I guess that EOL date for Win7 is fast approaching.

OTOH it isn't really hard to simulate the same with win32 named pipes.


I’ve always found it annoying that we have these multiple types of socket addressing, which were all invented before there was any concept of a URL, and yet there’s no clear bijective mapping between arbitrary socket address formats, and URL origins. I.e., you can’t just create a URL where the origin part is specified as a Unix domain socket path. (Or any other kind of format besides AF_INET and AF_INET6, despite there having been many other weird historical ones.)

Which means everyone who needs their client to support connecting to an RPC server over both AF_INET/6 sockets and AF_UN sockets, needs to invent some custom metaformat above URLs for specifying that they want a connection through a Unix socket path. Kind of ridiculous, really.


I'm not sure I understand what you mean. The AF_INET and AF_INET6 don't represent URLs because they are on completing different layer (layer 4) where the URLs operate on layer 7. On top of that there's an add-on protocol to translate name to IP - DNS.

To do what you wanted it would require to redesign the whole networking. Named Data Networking[1] (NDN) which unlike TCP/IP addresses data would be able to support URLs like that natively.

[1] https://named-data.net/project/


I'm not sure there would be any difficulty supporting Unix socket paths in your URI scheme? I guess you can just steal the semantics from the file:// scheme.

Or do you want the same URI scheme to support all your different transport protocols? It kind of feels like missing the point of URIs, after all the whole idea is that your application can interchangeably handle different kinds of URIs for different use cases.


Yes, in my ideal world, URIs would be a sequence of "steps" to resolving a resource, where each step starts off with a nominal "resolver object" it's working against, and then names a scheme (method) available upon that resolver object, where the rest of that step of the URL are then method parameters to that abstract call, resulting in either another "resolver object", or a fully-resolved resource. The scheme of the final step then determines the protocol for the interaction with the resource.

This is somewhat, but not perfectly, analogous to how 9P mounts work in Plan9. In both models, you start off with your local namespace, which has servers mounted on it like /mnt/web or /mnt/ftp. These are the "schemes" available on the initial resolver object. You then build a path to a virtual file within one of these mountpoints (/net/tcp, say), where that path turns out to point to a resource that describes (in a format another 9P server understands) how to create another mount representing that connection.

The difference between the two models is that, with 9P, you'd have to take this descriptor file and mount it yourself; whereas in my model, you wouldn't see a descriptor file, but rather another already-mounted server, with its own schemes as its root-level exposed virtual directories.

Pretending 9P worked this way, you'd be able to have a namespace path that looks like, e.g.:

    /net/tcp/foo.com/80/http/patha/pathb/a=b/c=d
Back in URL land, this kind of path would correspond to a "meta-URL" that looks like e.g.:

    tcp:foo.com:80#http:/patha/pathb?a=b&c=d
where the "#" symbol now takes on new significance as a command for the URL resolver, to address the next part of the path to the resolver-object discovered by resolving everything on the meta-URL so far. (Which perfectly subsumes #'s current use, since browsers interpret a # as identifying a particular DOM-tree node within a document—i.e. as using the resource acquired from the previous URL-step as a resolver to find the next object.)

You could do all sorts of crazy things with this, too, like creating URLs that represent files available within archives that exist as HTTP streams, e.g.:

    tcp:/foo.com/80#http:/foo/bar.tar#file:/index.html
Or adding DNS or TLS arbitrarily into protocol stacks that weren't designed for them:

    ip:/dns/foo.com#tcp#tls#irc:%23room



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: