Hacker News new | past | comments | ask | show | jobs | submit login
Linux TCP So_reuseport: Usage and Implementation (linuxjournal.rubdos.be)
33 points by todsacerdoti 43 days ago | hide | past | favorite | 9 comments



Confusingly, on FreeBSD this option is SO_REUSEPORT_LB; SO_REUSEPORT is something different. So I find myself writing

    #if defined SO_REUSEPORT_LB
      setsockopt(fd, SOL_SOCKET, SO_REUSEPORT_LB, &(int) { 1 }, sizeof(int));
    #elif defined SO_REUSEPORT
      setsockopt(fd, SOL_SOCKET, SO_REUSEPORT, &(int) { 1 }, sizeof(int));
    #else
      [do something to avoid reusing sockets]
    #endif
 
when I want to use this feature.


SO_REUSEPORT and SO_REUSEPORT_LB are not quite the same:

      SO_REUSEPORT    enables duplicate address and port bindings
      SO_REUSEPORT_LB enables duplicate address and port bindings
                      with load balancing
* https://man.freebsd.org/cgi/man.cgi?setsockopt(2)

The LB variant was added in only 2017:

* https://reviews.freebsd.org/D11003

SO_REUSEPORT has been around since 4.4BSD:

* https://man.freebsd.org/cgi/man.cgi?query=setsockopt&sektion...

SO_REUSEPORT was added to Linux in ~2013:

* https://lwn.net/Articles/542629/


AFAIU, the semantics of SO_REUSEPORT for TCP were consistent across the BSDs, at least since the time the Google folks started exploring this area. For TCP the behavior was effectively a LIFO/stack-like queue for listening sockets; incoming connections were enqueued with the most recent socket to bind. If that socket was closed, then the next most recent socket, if any, would be chosen. I don't know if this was originally deliberate, but this effectively permitted seamless, robust server restarts (automated or manual) without accidentally losing connections, without requiring an intermediate process (e.g. inetd or systemd), and without any complicated IPC, so long as the older process drained its listening queue before exiting.

Unfortunately, this behavior was never documented in the manual pages; only the load balancing-like UDP semantics were documented. Previously Linux didn't support SO_REUSEPORT at all, and unfortunately whomever decided to make use of SO_REUSEPORT on Linux either didn't check or didn't care what the actual behavior was for TCP connections on the BSDs. AFAIU, on the Linux side the original problem being solved was the infamous stampeding herd issue when multiple processes or threads were polling on listening TCP sockets, as can occur with non-blocking/asynchronous I/O frameworks that utilize multiple processes or threads, each with its own event loop polling (i.e. poll/epoll_wait/kqueue) for incoming connections. The semantics they chose were largely compatible with how the BSDs supported this for UDP, but not how it worked for TCP.

Arguably, at least regards TCP, SO_REUSEPORT wasn't added to Linux so much as an entirely different feature was created and the implementation decided to squat the pre-existing macro definition out of convenience. (SO_REUSEPORT had always been defined, but, IIRC, simply ignored by Linux, similar to SO_RCVLOWAT.) That's not charitable, but neither was the decision to either not investigate the actual BSD behavior, or if they did to ignore or discount it.


Indeed. SO_REUSEPORT_LB is the FreeBSD equivalent of SO_REUSEPORT on Linux. SO_REUSEPORT on FreeBSD means something different, as you say. If SO_REUSEPORT_LB exists, we're on FreeBSD and want to use it! :)


DragonflyBSD also implemented something in 2013:

* https://lists.dragonflybsd.org/pipermail/users/2013-July/053...


Correct, Johannes Lundberg implemented this on my team at LLNW and we needed to preserve the existing interface and behavior at both an API and ABI level.


Confusingly, on FreeBSD this option is SO_REUSEPORT_LB; SO_REUSEPORT is something different.

I think the confusing thing was that Linux added an option and named it "SO_REUSEPORT" even though that name had meant something completely different for decades.


Don't disagree for what it's worth. The non-_RB behaviour for streams is useful too and isn't available at all on Linux. Though it's nowhere near as annoying as Linux having the epoll() mess instead of kqueue().


This is for medium sized servers. Interesting.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: