Hacker News new | past | comments | ask | show | jobs | submit login
Container-to-Container Communication (miketheman.net)
80 points by miketheman on Dec 28, 2021 | hide | past | favorite | 44 comments



Normally I'm all for people using tried and tested primitives for things, however I think that in this case unix sockets are probably not the right choice.

Firstly you are creating a hard dependency for having the two services sharing a same box, with a shared file system (that's difficult to coordinate and secure.) But also should you add a new service that also want to connect via unix socket, stuff could get tricky to orchestrate.

But this also limits your ability to move stuff about, should you need it.

Inside a container, I think its probably a perfectly legitimate way to do IPC. Between containers, I suspect you are asking for trouble.


Within a pod it seems pretty reasonable.


Exactly, and this is where the concept of "Pod" really shines. Sharing networking and filesystems between containers is, from time to time, exactly the right strategy, as long as it can be encapsulated.


Unix socket requires elevated privileges that containers shouldn’t have in the first place.


Unix sockets only require as much permission as creating a file in a directory. If a program has write access to a directory, it can create a unix socket. File permission on the socket then dictate which users may connect to it.


Doesn't 700 requests per second for such a trivial service seem kinda slow?


Yeah, it's so slow that I'm wondering if they were actually measuring TCP/unix socket overhead. I wouldn't expect to see a difference at such a low frequency.


Yeah, seems like there was some other bottleneck. Maybe changing the IPC method makes the small difference we are seeing, but we should be seeing orders of magnitude higher TPS prior to caring about the IPC method.


I didn't look too closely, but I'm wondering if this is Python's GIL. So instead of nginx -> multiple ~independent Python processes, each async handler is fighting for the lock, even if running on a different core. So read as 700 queries / core vs 700 queries / server. If so, in a slightly tweaked Python server setup, that'd be 3K-12K/s per server. For narrower benchmarking, keeping sequential, and doing container 1 pinned to core 1 <-> container 2 pinned to core 2 might tell an even clearer picture.

I did enjoy the work overall, incl Graviton comparison. Likewise, OS X's painfully slow Docker impl has driven me to Windows w/ WSL2 -> Ubuntu + nvidia-docker, which has been night and day, so not as surprised that those numbers are weird.


Glad you liked it!

In my example, I'm using an unmodified gunicorn runner to load the uvloop worker. So I'm still only using a single worker process. Once I start tweaking the `--workers` count, I get a much higher queries per second.

And you're correct - this is a narrow benchmark, not designed to test total TPS or saturate resources.


Great, suspected something like that :)

Interesting wrt workers -- how does TCP vs sockets start looking in a multiple worker + multicore scenario, esp. wrt peak QPS? That's more confusing for schedulers :) Also, FWIW, the GIL thing might even be true in the case of single core <> single vs multiple workers. Docker supports pinning (`cpuset`?), so should be pretty doable..

We actually have a fairly similar production setup, so up to GPUs confusing things, have been curious on how deep to look in upcoming scaling work.

As a fun extra wrinkle, we also used to have nginx as a poor man's api gateway: `request -> [ nginx_container -[tcp]-> app_container1 -[tcp]-> nginx_container -[tcp]-> app_container2 -[tcp]-> ... ]`, but had enough quality issues that we removed the internal nginx indirections.


Wow, that's a lot of indirection! I've had the same setup, but only two requests deep. I think that if there's added value of each nginx tier (if each service has static assets, for example), then there may be some benefit, but tracing through all the layers gets harder. Glad to hear you removed the internal layers!


It may, but this is considering the overhead incurred in the local setup - calling the app through a Docker Desktop exposed port. Running locally on macOS produces ~5000 TPS. The `ab` parameters are also not using any sort of concurrency, and perform requests sequentially. The test is not designed to maximize TPS or saturate the resources, rather isolate variables to perform the comparison.


But this test becomes a lot more interesting when the bottleneck is the actual thing under scrutiny (socket performance), rather than Whatever Python Is Doing. You can probably get 10-20x the RPS out of a trivial golang server, which might shift the bottleneck closer to socket perf.


Possibly! I wasn't trying to get the highest RPS, rather find a baseline and compare per architecture. I also wrote what I knew how write in a short amount of time. ;)


Bit besides the point, but how many of you do still run nginx inside container infrastructures? I've been having container hosts behind a firewall without explicit WAN access for a long time -- to expose public services, I offload the nginx tasks to CloudFlare by running `cloudflared` tunnel. These "Argo" tunnels are free to use, and essentially give you a managed nginx for free. Nifty if you are using CloudFlare anyway.


Not nginx, but I run haproxy, which serves the reverse proxy role.

I use it instead of Google's own ingress because it gives you better control over load balancing and can react faster to deployments.


We run Traefik in our stack at work.



I think this is where `gRPC` shines. It can feel tedious but really, define the interface and use the tooling to generate the stubs, implement and done. It prevents having to think up and implement a protocol and importantly versioning for if/when the features of the containerized apps start to grow/change.


The results are only relevant for AWS ECS Fargate due to the specifics of how they do volumes and CNI.


Truth! Wouldn't have it any other way. ;)


The multiple layers of abstraction in this make this test sorta moot. You have the AWS infra, the poor MacOS implementation of Docker, the server architecture. Couldn't you have just had a vanilla Ubuntu install and curl some dummy load n times and get some statistics from that?


Possibly - however the premise is that I'm running an application on cloud infrastructure that I don't control - which is common today. I tried to call that out in the post.


https://podman.io/getting-started/network

> By definition, all containers in the same Podman pod share the same network namespace. Therefore, the containers will share the IP Address, MAC Addresses and port mappings. You can always communicate between containers in the same pod, using localhost.

I'm a noob here but why wouldn't you use IPC?

https://docs.podman.io/en/latest/markdown/podman-run.1.html#...


Network and IPC namespacing are just two different ways of communicating using namespaces. I can't imagine that the overhead of either is significant.

I can see, from a design perspective, that network namespacing is likely more scalable. Network addresses can be local or WAN while unix sockets would tie you to a single node. That implies a very different and slightly more rigid scaling strategy with IPC based communication.


For one, I'm not using `podman`. I'm using Amazon Elastic Container Service (Amazon ECS) on top of the Fargate compute layer, so I don't have to manage much beyond my application.


Ah gotcha. I understand some of those words.


I’m curious as to whether the HTTP requests re-used the TCP socket or if they were dumb “Connection: close” ones that closed the socket and set up a new one for each request.

The overhead for that alone would outstrip any benefits.


I don't think the `ab` tool does anything special, so it's likely these are closed and reopened after each request. For the purposes of comparison of one architecture to another, this was sufficient, and mimics user behavior well enough. But feel free to try out the test with different headers!


Everyone I know in the API / CDN / Proxy space switched from using ab to wrk years ago for good reason. PSA: Please stop using ab and use wrk/wrk2. ab doesn't even support HTTP 1.1...


Good to know! Would you be interested in performing the same tests and using `wrk` and report the difference?


It would be more of apples to apples comparison to run it on the same setup you ran ab on. You can grab wrk here: https://github.com/wg/wrk Enjoy!


It’s not just the tool, but also the web framework that has to support HTTP/1.1.


Isn't this what a socket library like zeromq is supposed to cover? Change transports (tcp, ipc, inproc if in the same process, udp with radio/dish,...) through config files, when deploying?


I'm plugging this amazing resource (not only containers but also virtual machines...): https://developers.redhat.com/blog/2018/10/22/introduction-t...

It's lower lever than OP but might give ideas.


There were a few other combinations I wanted to see -- how does docker-to-docker compare with socket communication on the same local machine? I would love to know if there's a difference.

The results when running on other machines could be impacted by a number of different factors. Almost impossible to know what is limiting performance without deep diving into the logs


I'm not sure I understand? These examples all use the Fargate compute layer and are deployed as a Task which colocates both containers on the same hardware, so there's no cross-machine factors to contend with.


No talk about permissions, I think locking down access is also an interesting aspect of Unix Domain Sockets compared to TCP sockets.


As if the TCP is less secure? Not necessarily as the TCP sockets here most likely live in a private network between the two containers: you run `podman network create example` and then `podman run --net=example whatever`. This creates an internal network (e.g., 10.0.0.1/10) over which the pods communicate. Only if you want to expose some port would you use `-p` flag to expose it to host LAN, which you then redirect to WAN.


Not the person you responded to, but at the very least host processes can still access the internal network which may be undesired.


Indeed. You can use filesystem permissions to lock the socket down, and do client auth via the PID and/or UID of the client process instead of at the HTTP layer. For example you can assert that the client is from the Docker container that it's claiming to be from, by checking that its PID is in the list of PIDs in that container.

Of course this also requires some discipline from the user, like not running containers as root, and not running containers with Docker socket access (which is as good as having root).


You can do some weird stuff with Unix Domain sockets, though, like using sendmsg() to pass over an open file descriptor.


Seems irrelevant when running containers.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: