
Scaling Linux Services: Before accepting connections - theojulienne
https://theojulienne.io/2020/07/03/scaling-linux-services-before-accepting-connections.html
======
brendangregg
Nice, although if you want to explore networking with ad hoc tracing tools,
please try bpftrace[0]. Only use BCC once you need argparse and other python
libraries.

Here's my bpftrace SYN backlog tool from BPF Performance Tools (2019 book,
tools are online[1]):

    
    
      # tcpsynbl.bt
      Attaching 4 probes...
      Tracing SYN backlog size. Ctrl-C to end.
      ^C
      @backlog[backlog limit]: histogram of backlog size
    
      @backlog[128]: 
      [0]                    2 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
    
      @backlog[500]: 
      [0]                 2783 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
      [1]                    9 |                                                    |
      [2, 4)                 4 |                                                    |
      [4, 8)                 1 |                                                    |
    

The source:

    
    
      #!/usr/local/bin/bpftrace
      
      #include <net/sock.h>
      
      BEGIN
      {
              printf("Tracing SYN backlog size. Ctrl-C to end.\n");
      }
      
      kprobe:tcp_v4_syn_recv_sock,
      kprobe:tcp_v6_syn_recv_sock
      {
              $sock = (struct sock *)arg0;
              @backlog[$sock->sk_max_ack_backlog & 0xffffffff] =
                  hist($sock->sk_ack_backlog);
              if ($sock->sk_ack_backlog > $sock->sk_max_ack_backlog) {
                      time("%H:%M:%S dropping a SYN.\n");
              }
      }
      
      END
      {
              printf("\n@backlog[backlog limit]: histogram of backlog size\n");
      }
    

This bpftrace tool is only 24 lines. The BCC tools in this post are >200 lines
(and complex: needing to worry about bpf_probe_read() etc). The bpftrace
version can also be easily modified to include extra details. I'm summarizing
backlog length as a histogram since our prod hosts can accept thousands of
connections per second.

[0] [https://github.com/iovisor/bpftrace](https://github.com/iovisor/bpftrace)
[1] [https://github.com/brendangregg/bpf-perf-tools-
book](https://github.com/brendangregg/bpf-perf-tools-book)

~~~
theojulienne
Thanks for the suggestion! I did come across the `tcpsynbl.bt` script as I was
writing up this post, but wanted to add the additional information around
namespaces and report additional information, which didn't seem as trivial in
`bpftrace` as it was in Python, but that might be my lack of familiarity with
the DSL :)

~~~
brendangregg
If it's a common use case it's trivial, and if it's not yet trivial we'll make
it trivial. :) Niche functionality that doesn't fit well can be deferred to
BCC.

------
IgorPartola
This is well written. I never gave much thought to the resource usage during
the period between SYN and accept. This article explained it very nicely.
Also, now I’m curious why don’t these Linux limits don’t scale with the amount
of RAM available? Like, yes on a low resource machine you wouldn’t want more
than the default 128 for the backlog. But if I have 512GB of RAM then why not
give me a backlog of a few thousand?

~~~
JoshTriplett
In general, Linux does favor automatic defaults over fixed static settings, if
there's a reasonable heuristic to produce those defaults. But suppose, for
instance, that you can't actually handle that many connections? There are two
possibilities here: one is that you are processing connections fast enough to
keep up, and the other is that you're not keeping up at all. In the former
case, scaling the backlog up may help you keep up, though you may already have
unacceptable latency. In the latter case, no amount of backlog will help you,
and the backlog may make an attacker's job easier.

That said, there might well be a case for automatic backlog scaling. Or, for
that matter, for increasing the default.

------
ampersandy
Is there a particular reason the Linux kernel favors names like `somaxconn`
instead of `socket_max_connections`? It seems like a rather straightforward
improvement for readability; so, why are shorter, compressed names preferred?

~~~
codys
This particular name originates from BSD 4.2 [1], which was released in 1983.
(For some context, GCC 1.0 is from 1987, pcc was used to build BSD 4.2. The
first Linux release was 1991).

1: [https://github.com/dspinellis/unix-history-
repo/blob/0f4556f...](https://github.com/dspinellis/unix-history-
repo/blob/0f4556f12c8f75078501c9d1338ae7648a97f975/usr/src/sys/h/socket.h#L91)

------
fabian2k
What are some very rough estimates on when it makes sense to look at these
low-level network settings when scaling an application? I assume the default
settings are good enough for moderate loads, but at which point does this
stuff become a bottleneck?

Are the default setting here reasonable for most cases, or is it more like
something that you should tune even if you're not really pushing any limits?

~~~
VWWHFSfQ
My NGINX webserver configuration on AWS behind an ALB is:

/etc/sysctl.conf:

    
    
        net.core.wmem_max = 12582912
        net.core.rmem_max = 12582912
        net.ipv4.tcp_rmem = 10240 87380 12582912
        net.ipv4.tcp_wmem = 10240 87380 12582912
        fs.file-max = 1000000
        net.ipv4.ip_local_port_range = 1024 65535
        net.ipv4.tcp_tw_recycle = 1
        net.ipv4.tcp_tw_reuse = 1
        net.ipv4.tcp_max_syn_backlog = 262144
        net.ipv4.tcp_syncookies = 0
        net.ipv4.tcp_fin_timeout = 3
        net.ipv4.tcp_syn_retries = 2
        net.ipv4.tcp_synack_retries = 2
        net.ipv4.tcp_no_metrics_save = 1
        net.ipv4.tcp_max_orphans = 262144
        net.core.somaxconn = 1000000
    

nginx.conf (just the relevant directives):

    
    
        worker_rlimit_nofile 102400;
    
        events {
          worker_connections 102400;
          multi_accept on;
        }
    
        http {
            server {
              listen 80 default_server reuseport backlog=102400;
              ...
            }    
        }
    

As you can see, the socket and backlog-related values have been cranked way
up. I've never had any problems with this configuration. Because these servers
are behind and ALB I don't know how relevant they are since the SYN and SYN-
ACK relation to RTT is between the server and the load balancer, not the
remote clients. But I could be wrong. Maybe there's something I'm missing. But
I've never had a problem, and I've never had any performance problems related
to TCP connections in the kernel or NGINX.

~~~
shanemhansen
I think for ALB you'll see pooled connections (http or http2) so I would
expect the number of TCP connections to stay pretty low. In http2 it could
theoretically be as low as one.

------
eximius
Why do these values default so low? Seems like you could have a few hundred or
1k instead of 128 with relatively little memory overhead.

~~~
cutemonster
I've wondered about that me too. What if it's because in the 1990s there
wasn't that much memory

------
lathiat
This was a great read and introduction with great practical examples, great
graphics and bpf tracing. I loved it.

------
batter
I wish saw this when i was preparing one of the apps to huge traffic spikes.
Did almost all of the changes described. But took some time to search and
understand each of them. End result was pretty good.

------
marcosdumay
Yeah, just more reasons for microkernels and keeping those things in
userspace, where applications can come with values that suit their usage.

------
yxhuvud
I wonder how io_uring changes this equation. The ability to have multiple
concurrent accepts queued up could push the limits around a bit.

------
jeffbee
A good advertisement for userspace networking.

~~~
cranekam
Why? Because this stuff is in the kernel and thus harder to see? I don’t
expect moving it to userspace to reduce overall complexity —- just move it
elsewhere.

~~~
jeffbee
Because it’s in the kernel you aren’t exposed to all the knobs. There are
hundreds of parameters controlling Linux tcp behavior. Even experts overlook
some aspects. Hoisting this up into your application makes it visible. Why
should there be a system parameter that limits the accept backlog of your
server, silently? It makes no sense.

~~~
stjohnswarts
Because the people who wrote it think it's better being a bit more
inaccessible in the kernel so that regular users don't shoot themselves in the
foot thinking they know better than the designers what the values should be.
The people who know what they're doing will be able to set the parameters
regardless of where they're hiding.

~~~
jeffbee
This condescending attitude is itself a strong argument against whatever it is
you work on.

