Hacker News new | past | comments | ask | show | jobs | submit login
Unix Domain Sockets vs Loopback TCP Sockets (2014) (nicisdigital.wordpress.com)
150 points by e12e 8 months ago | hide | past | favorite | 75 comments




I agree. Always choose unix domain sockets over local TCP if it is an option. There are some valid reasons though to choose TCP.

In the past, I've chosen local TCP sockets because I can configure the receive buffer size to avoid burdening the sender (ideally both TCP and unix domain sockets should correctly handle EAGAIN, but I haven't always had control over the code that does the write). IIRC the max buffer size for unix domain sockets is lower than for TCP.

Another limitation of unix domain sockets is that the size of the path string must be less than PATH_MAX. I've run into this when the only directory I had write access to was already close to the limit. Local TCP sockets obviously do not have this limitation.

Local TCP sockets can also bypass the kernel if you have a user-space TCP stack. I don't know if you can do this with unix domain sockets (I've never tried).

I can also use local tcp for websockets. I have no idea if that's possible with unix domain sockets.

In general, I choose a shared memory queue for local-only inter-process communication.


> I can also use local tcp for websockets. I have no idea if that's possible with unix domain sockets.

The thing that makes this possible or impossible is how your library implements the protocol, at least in C/C++. The really bad protocol libraries I've seen like for MQTT, AMQP, et. al. all insist on controlling both the connection stream and the protocol state machine and commingle all of the code for both. They often also insist on owning your main loop which is a bad practice for library authors.

A much better approach is to implement the protocol as a separate "chunk" of code with well-defined interfaces for receiving inputs and generating outputs on a stream, and with hooks for protocol configuration as-needed. This allows me to do three things that are good: * Choose how I want to do I/O with the remote end of the connection. * Write my own main loop or integrate with any third-party main loop that I want. * Test the protocol code without standing up an entire TLS connection.

I've seen a LOT of libraries that don't allow these things. Apache's QPID Proton is a big offender for me, although they were refactoring in this direction. libmosquitto provides some facilities to access the filedescriptor but otherwise tries to own the entire connection. So on and so forth.

Edit: I get how you end up there because it's the easiest way to figure out the libraries. Also, if I had spare time on my hands I would go through and work with maintainers to fix these libraries because having generic open-source protocol implementations would be really useful and would probably solve a lot of problems in the embedded space with ad-hoc messaging implementations.

If the protocol library allows you to control the connection and provides a connection-agnostic protocol implementation then you could replace a TLS connection over TCP local sockets from OpenSSL with SPI transfers or CAN transfers to another device if you really wanted to. Or Unix Domain Sockets, because you own the file descriptor and you manage the transfers yourself.


> Local TCP sockets can also bypass the kernel if you have a user-space TCP stack. I don't know if you can do this with unix domain sockets (I've never tried).

Kernel bypass exists because hardware can handle more packets than the kernel can read or write, and all the tricks employed are clever workarounds (read: kinda hacks) to get the packets managed in user space.

This is kind of an orthogonal problem to IPC, and there's already a well defined interface for multiple processes to communicate without buffering through the kernel - and that's shared memory. You could employ some of the tricks (like LD_PRELOAD to hijack socket/accept/bind/send/recv) and implement it in terms of shared memory, but at that point why not just use it directly?

If speed is your concern, shared memory is always the fastest IPC. The tradeoff is that you now have to manage the messaging across that channel.


In my experience, for small unbatchable messages, UNIX sockets are fast enough not to warrant the complexity of dealing with shared memory.

However, for bigger and/or batchable messages, shared memory ringbuffer + UNIX socket for synchronization is the most convenient yet fast IPC I've used.


On Linux you can use abstract names, prefixed with a null byte. They disappear automatically when your process dies, and afaik don’t require rw access to a directory.


> Another limitation of unix domain sockets is that the size of the path string must be less than PATH_MAX. I've run into this when the only directory I had write access to was already close to the limit. Local TCP sockets obviously do not have this limitation.

This drove me nuts for a long time, trying to hunt down why the socket couldn't be created. it's a really subtle limitation, and there's not a good error message or anything.

In my use case, it was for testing the server creating the socket, and each test would create it's own temp dir to house the socket file and various other resources.

> In general, I choose a shared memory queue for local-only inter-process communication.

Do you mean the sysv message queues, or some user space system? I've never actually seen sysv queues in the wild, so I'm curious to hear more.


Depends on the user-space stack, but OpenOnload doesn't. But, this topic of user-space acceleration of pipes created over Unix sockets comes up here periodically... some of my previous comments:

https://news.ycombinator.com/item?id=24968260 Talking about using kernel bypass on pipes accepted over a UNIX socket. Link to an old asio example implementation on GitHub

https://news.ycombinator.com/item?id=31922762 Kernel bypass to FPGA journey, followed up with some user-space pipe talk with others

I do tend to use accelerated TCP loopback instead of the UNIX pipes, was just easier operationally across a cluster to use TCP.


Isn't PATH_MAX 4k characters these days? Have to have some pretty intense directory structures to hit that.


For unix domain sockets on Linux the max is 108 including a null terminator.

https://www.man7.org/linux/man-pages/man7/unix.7.html

https://unix.stackexchange.com/questions/367008/why-is-socke...


The biggest reason for me is that you can use filesystem permissions to control access. Often I want to run a service locally and do auth at the reverse proxy, but if the service binds to localhost then all local processes can access without auth. If I only grant the reverse proxy permissions on the filesystem socket then you can't access without going through the auth.


And with `SO_PEERCRED`, you can even implement more complex transparent authorization & logging based on the uid of the connecting process.


This is true but to me mostly negates the benefit for this use case. The goal is to offload the auth work to the reverse proxy not to add more rules.

Although I guess you could have the reverse proxy listen both on IP and UNIX sockets. It can then do different auth depending on how the connection came in. So you could auth with TLS Cert or Password over IP or using your PID/UNIX account over the UNIX socket.


These matter if you have need to bind to multiple ports, but if you're only running a handful of services that need to bind a socket, then port number allocation isn't a big issue. TCP Buffer autotune having problems also matters at certain scale, but in my experience requires a tipping point. TCP sockets also have configurable buffer sizes while Unix sockets have a fixed buffer size, so TCP socket buffers can get much deeper.

At my last role we benchmarked TCP sockets vs Unix sockets in a variety of scenarios. In our benchmarks, only certain cases benefited from Unix sockets and generally the complexity of using them in containerized environments made them less attractive than TCP unless we needed to talk to a high throughput cache or we were doing things like farming requests out to a FastCGI process manager. Generally speaking, using less chatty protocols than REST (involving a lot less serde overhead and making it easier to allocate ingest structures) made a much bigger difference.

I was actually a huge believer in deferring to Unix sockets where possible, due to blog posts like these and my understanding of the implementation details (I've implemented toy IPC in a toy kernel before), but a coworker challenged me to benchmark my belief. Sure enough on benchmark it turned out that in most cases TCP sockets were fine and simplified a containerized architecture enough that Unix sockets just weren't worth it.


> the complexity of using [UNIX sockets] in containerized environments made them less attractive than TCP

Huh, I would think UNIX sockets would be easier; since sharing the socket between the host and a container (or between containers) is as simple as mounting a volume in the container and setting permissions on the socket appropriately.

Using TCP means dealing with iptables and seems... less fun. I easily run into cases where the host's iptables firewall interferes with what Docker wants to do with iptables such that it takes hours just to get simple things working properly.


it's an issue of tooling I thing, through dependent on what containerized runtimes

e.g. in docker you can use -p to publish ports of containers on the host, this trends to get much more messy less ad-hoc usage where you want to publish them between containers, but docker-compose and similar handle all that for you

the benefit of that is this works with the container rubbing using a vm or a namespace created by you or root and it even can work if the container is run somewhere else

with pipes you have to volume mount them and do so in a way which works with whatever docker uses to do so, which if you then also mix in docker on windows or Mac can get a bit annoying

through of we speak about containerization for apps e.g. using snap/flatpack pipes should work just fine

and in the end they are the most common used for cross process communication on the same system, i.e. use case wher you don't have to worry about vms and cross os communication


This.

Especially, docker does a lot of magic dynamically adding/removing iptables rules, which is already a nightmare to manage, so you really want to avoid dealing with more.


Also UDS have more features, for example you can get the remote peer UID and pass FDs


And SOCK_SEQPACKET which greatly simplifies fd-passing


How does SOCK_SEQPACKET simplify fd-passing? Writing a streaming IPC crate as we speak and wondering if there are land mines beyond https://gist.github.com/kentonv/bc7592af98c68ba2738f44369208...


Well, the kernel does create implicit packetization boundary when you attach FDs to a byte-stream... but this is underdocumented and there's an impedance mismatch between byte streams and discrete application-level messages. You can also send zero-sized messages to pass an FD. with byte streams you must send at least one byte. Which means you can send the FDs separately after sending the bytes which makes it easier to notify the application that it should expect FDs (in case it's not always using recvmsg with an cmsg allocation prepared). SEQPACKET just makes it more straight-forward because 1 message (+ancillary data) is always one sendmsg/recvmsg pair.


I appreciate your reply!

My approach has been to send a header with the number of fds and bytes the next packet will contain, and the number of payload bytes is naturally never 0 in my case.


+1


If only there was a button for that.


It's a bit obscure but 127.x.x.x is a /8. So you have quite a few loopback IPs/port combos. I've tested it and it works with Windows, Linux, GHS integrity.


on some systems it's /8 but you still can only bind to 127.0.0.x we ran into that during testing before


We've seen observable performance increases in migrating to unix domain sockets wherever possible, as some TCP stack overhead is bypassed.


Adjacently, remember that with TCP sockets you can vary the address anywhere within 127.0.0.0/8


However this is not the case for ipv6. Technically you can use only ::1, unless you do Ipv6 FREEBIND


You usually have a whole bunch of link-local IPv6 addresses. Can't you use them?


One problem I've run into when trying to use Unix sockets though is that it can only buffer fairly few messages at once, so if you have a lot of messages in flight at once you can easily end up with sends failing. TCP sockets can handle a lot more messages.


Can't you tune this with sysctl?


You can set net.core.wmem_default, though that's a system-wide setting you have to override. And then you can end up with large message queues instead, if you have a lot of small messages (which is a concern in the embedded systems I work on). The problem is really the large overhead of the messages, of close to a kilobyte per message. TCP sockets have just a fraction of the overhead.


If different components of your system are talking over a pretend network you've already architectured yourself face first into a pile of shit. There's no argument for quality either way so I'll just use TCP sockets and save myself 2 hours when I inevitably have to get it running on Windows.


FYI, Windows supports Unix domain sockets since Windows 10 / Server 2019.


I had not head of this! Long story short, AF_UNIX now exists for Windows development.

https://devblogs.microsoft.com/commandline/af_unix-comes-to-... https://visualrecode.com/blog/unix-sockets/#:~:text=Unix%20d....


Good thing to mention, thanks.

That's mostly why I said 2 hours and not a day, as you still have to deal with paths (there's no /run) and you may have to fickle with UAC or god save us NTFS permissions


>If different components of your system are talking over a pretend network you've already architectured yourself face first into a pile of shit.

How do you have your file delivery, database, and business logic "talk" to each other? Everything on the same computer is a "pretend network" to some extent, right? Do you always architect your own database right into your business logic along with a web-server as a single monolith? One off SPAs must take 2-3 months!


Yes, replacing a full duty database with an in-process SQLite generally simplifies things if you can afford it. Even if not that's a bad example, since in prod your fat database will be om another computer for real, so you'd never use a Unix socket when developing locally.


AF_VSOCK is another one to consider these days. It's a kind of hybrid of loopback and Unix. Although they are designed for communicating between virtual machines, vsock sockets work just as well between regular processes. Also supported on Windows.

https://www.man7.org/linux/man-pages/man7/vsock.7.html https://wiki.qemu.org/Features/VirtioVsock


With some luck and love in the future hopefully we'll also be able to use them in containers https://patchwork.kernel.org/project/kvm/cover/2020011617242... which would simplify a lot of little things.


VMM's such as firecracker and cloud-hypervisor translate between vsock and UDS. [1]

In recent kernel versions, sockmap also has vsock translation: <https://github.com/torvalds/linux/commit/5a8c8b72f65f6b80b52...>

This allows for a sort of UDS "transparency" between guest and host. When the host is connecting to a guest, the use of a multiplexer UDS is required. [1]

[1] <https://github.com/firecracker-microvm/firecracker/blob/main...>


What's the advantage to vsocks over Unix domain sockets? UDS's are very fast, and much easier to use.


I didn't mean to imply any advantage, just that they are another socket-based method for two processes to communicate. Since vsocks use a distinct implementation they should probably be benchmarked alongside Unix domain sockets and loopback sockets in any comparisons. My expectation is they would be somewhere in the middle - not as well optimized as Unix domain sockets, but with less general overhead than TCP loopback.

If you are using vsocks between two VMs as intended then they have the advantage that they allow communication without involving the network stack. This is used by VMs to implement guest agent communications (screen resizing, copy and paste and so on) where the comms don't require the network to have been set up at all or be routable to the host.


I did not know about this. Thanks for the tip!


I'd be more interested in the security and usability aspect. Loopback sockets (assuming you don't accidentally bind to 0.0.0.0, which would make it even worse) are effectively rwx to any process on the same machine that has the permission to open network connections, unless you bother with setting up a local firewall (which requires admin privileges). On top of that you need to figure out which port is free to bind to, and have a backup plan in case the port isn't free.

Domain sockets are simpler in both aspects: you can create one in any suitable directory, give it an arbitrary name, chmod it to control access, etc.


A lot of modern software disregards the existence of unix sockets, probably because TCP sockets are an OS agnostic concept and perform well enough. You'd need to write Windows-specific code to handle named pipes if you didn't want to use TCP sockets.


Windows actually added Unix sockets about six years ago, and with how aggressive Microsoft EOLs older versions of their OS (relative to something like enterprise linux at least), it's probably a pretty safe bet to use at this point.

https://devblogs.microsoft.com/commandline/af_unix-comes-to-...


With how aggressively Microsoft EOLs older versions of their OS, we're still finding decades-old server and client systems at clients.

While Server 2003 is getting more rare and the last sighting of Windows 98/2000 has been a while, they're all running at the very least a few months after the last free security support is gone. But whether that's something you want to support as a developer is your choice to make.


That's not very relevant.

If you start developing a new software today, it won't need to run on those computers. And if it's old enough that it need to, you can bet all of those architectural decisions were already made and written into stone all over the place.


> If you start developing a new software today, it won't need to run on those computers.

This is a weird argument to make.

For context, I work on mesh overlay VPNs at Defined.net. We initially used Unix domain sockets for our daemon-client control model. This supported Windows 10 / Server 2019+.

We very quickly found our users needed support for Server 2016. Some are even still running 2012.

Ultimately, as a software vendor, we can't just force customers to upgrade their datacenters.


It’s actually the opposite of Microsoft quickly eoling on the server side. Server 2012 was EVERYWHERE as late as 2018-2019. They were still issuing service packs in 2018.


Interesting, thanks.


Going forward, hopefully modern software will use the modern approach of AF_UNIX sockets in Windows 10 and above: https://devblogs.microsoft.com/commandline/af_unix-comes-to-...

EDIT: And it would be interesting for someone to reproduce a benchmark like this on Windows to compare TCP loopback and the new(ish) unix socket support.


windows is exactly the reason they didn't prevail imo. Windows named pipes have weird security caveats and are not really supported in high level languages. I think this lead everyone to just using loopback TCP as the portable IPC communication API instead of going with unix sockets.


IME a lot of developers have never even heard of address families and treat "socket" as synonymous with TCP (or possibly, but rarely, UDP).


A couple of years after this article came out Windows added support for SOCK_STREM Unix sockets.


Yes, but there's NamedPipes and they can be used the same way on Windows. And Windows also supports UDS as well today. It's no excuse.


I imagine there should be some OS-agnostic libraries somewhere that handle it and provide the developer a unified interface.


One of the best things I did for my old home-server was to switch to Unix sockets for the majority of my selfhosted services' databases, the performance difference on that very low end hardware was monumental. Now I'm on NVME with a modern CPU the difference isn't as marked but why leave any performance on the table?


> Two communicating processes on a single machine have a few options

Curiously, the article does not even mention pipes, which I would assume to be the most obvious solution for this task (but not necessarily the best, of course!)

In particular, I am wondering how Unix domain sockets compare to (a pair of) pipes. At first glance, they appear to be very similar. What are the trade-offs?


The pipe vs. socket perf debate is a very old one. Sockets are more flexible and tunable, which may net you better performance (for instance, by tweaking buffer sizes), but my guess is that the high order bit of how a UDS and a pipe perform are the same.

Using pipes instead of a UDS:

* Requires managing an extra set of file descriptors to get bidirectionality

* Requires processes to be related

* Surrenders socket features like file descriptor passing

* Is more fiddly than the socket code, which can often be interchangeable with TCP sockets (see, for instant, the Go standard library)

If you're sticking with Linux, I can't personally see a reason ever to prefer pipes. A UDS is probably the best default answer for generic IPC on Linux.


With pipes, the sender has to add a SIGPIPE handler which is not trivial to do if it's a library doing the send/recv. With sockets it can use send(fd, buf, MSG_NOSIGNAL) instead.


What's in the way of TCP hitting the same performance as unix sockets, is it just netfilter?


TCP has a lot of rules nailed down in numerous RFCs - everything from how to handle sequence numbers, the 3-way handshake, congestion control, and much more.

That translates into a whole lot of code that needs to run, while unix sockets are not that much more than a kernel buffer and code to copy data back and forth in that buffer - which doesn't need a lot of code to make happen.


I believe the conventional wisdom here is that UDS performs better because of fewer context switches and copies between userspace and kernelspace.


No. This is exactly the same. Think about life of a data gram or stream bytes on the syscall edge for each.


I’m not sure I understand. This isn’t something I haven’t thought about in a while, but it’s pretty intuitive to me that a loopback TCP connection would pretty much always be slower: each transmission unit goes through the entire TCP stack, feeds into the TCP state machine, etc. Thats more time spent in the kernel.


Yea but those aren’t context switches.


The ip stack.


Would be better to retest

If I remember correct, we had the same results described in article in 2014, but also I remember that linux loopback was optimized after it and different was much smaller if visible


Would TCP_NODELAY make any difference (good or bad)?


Why not UDP? Less overhead and you can use multicast to expand messaging to machines in a lan. TCP on localhost makes little sense, especially when simple ack's can be implemented in UDP.

But even then, I wonder how the segmentation in TCP is affecting performance in addition to windowing.

Another thing I always wanted to try was using raw IP packets, why not? Just sequence requests and let the sender close a send transaction only when it gets an ack packet with the sequence # for each send. Even better, a raw AF_PACKET socket on the loopback interface! That might beat UDS!


Give it a try and find out! I'd give that blog post a read.

I suspect you'd run into all sorts of interesting issues... particularly if the server is one process but there are N>1 clients and you're using AF_PACKET.


Why not, I will try to not let the downvotes discourage me lol




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: