
Linux TCP/IP Tuning - bluesmoon
http://www.lognormal.com/blog/2012/09/27/linux-tcpip-tuning/
======
vanni
_"nginx only uses HTTP/1.0 when it proxies requests to a back end server"_

No more:

[http://www.quora.com/Why-doesnt-Nginx-talk-HTTP-1-1-to-
upstr...](http://www.quora.com/Why-doesnt-Nginx-talk-HTTP-1-1-to-upstream-
servers)

[http://nginx.org/en/docs/http/ngx_http_upstream_module.html#...](http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive)

[http://mailman.nginx.org/pipermail/nginx/2011-August/028324....](http://mailman.nginx.org/pipermail/nginx/2011-August/028324.html)

~~~
bluesmoon
yep, unfortunately it will be a while before it gets into apt

~~~
laymil
If you're running nginx in production on Ubuntu or Debian, you'd probably be
well-served by using their packages rather than the distribution default ones.

<http://nginx.org/en/download.html>

~~~
bluesmoon
thanks, will look into that

------
caf
_In order for a collision to take place, we’d have to get a new connection
from an existing client, AND that client would have to use the same port
number that it used for the earlier connection, AND our server would have to
assign the same port number to this connection as it did before._

Ephemeral ports aren't assigned to inbound connections, they're used for
outbound connections. So, for the client-to-nginx connection, both the server
IP and port are fixed (the port will be either 80 or 443) - only the client IP
and port change, so for a collision all you need is for a client to re-use the
same port on its side quickly.

For the nginx to node connection, both IPs and the server port are fixed,
leaving only the ephemeral port used by nginx to vary. You don't have to worry
about out-of-order packets here though, since the connection is loopback.

Note that only the side of the connection that initiates the close goes into
TIME_WAIT - the other side goes into a much shorter LAST_ACK state.

~~~
nviennot
For the nginx to node connection, it's a unix socket. No issue.

~~~
olivier1664
So just to be sure, some noobs questions: The 'Ephemeral Ports' and the
'TIME_WAIT state' tricks are here to handle the connections from nginx to
Node.js (not for the client to nginx)?

Socket from client to nginx are well identified by the client IP an the client
port. On each client request, nginx create a new socket to node.js?

There can be more than one node.js intance running? That's the main goal of
nginx here, or there is some additional benifices? > Edit, ok: "nginx is used
for almost everything: gzip encoding, static file serving, HTTP caching, SSL
handling, load balancing and spoon feeding clients"
[http://blog.argteam.com/coding/hardening-node-js-for-
product...](http://blog.argteam.com/coding/hardening-node-js-for-production-
part-2-using-nginx-to-avoid-node-js-load/)

------
meritt
Really surprised this article doesn't mention _tcp_tw_reuse_ or
_tcp_tw_recycle_. These have a more substantial impact that simply adjusting
TW, as those ports will still be in a FIN_WAIT status for a long time before
reuse as well.

Excellent article on the subject.: <http://www.speedguide.net/articles/linux-
tweaking-121>

~~~
semenko
Agreed. There are a lot of other possible optimizations, from the often-
mentioned buffer size settings:

    
    
      net.core.rmem_max / net.core.wmem_max
      net.ipv4.tcp_rmem / net.ipv4.tcp_wmem
    

to metric tunings like:

    
    
      net.ipv4.tcp_no_metrics_save / net.ipv4.tcp_moderate_rcvbuf
    

I've been playing around with these settings on very loaded machines:

    
    
      # Retry SYN/ACK only three times, instead of five
      net.ipv4.tcp_synack_retries = 3
      # Try to close things only twice
      net.ipv4.tcp_orphan_retries = 2
      # FIN-WAIT-2 for only 5 seconds
      net.ipv4.tcp_fin_timeout = 5
      # Increase syn socket queue size (default: 512)
      net.ipv4.tcp_max_syn_backlog = 2048
      # One hour keepalive with fewer probes (default: 7200 & 9)
      net.ipv4.tcp_keepalive_time = 3600
      net.ipv4.tcp_keepalive_probes = 5
      # Max packets the input can queue
      net.core.netdev_max_backlog = 2500
      # Keep fragments for 15 sec (default: 30)
      net.ipv4.ipfrag_time = 15
      # Use H-TCP congestion control
      net.ipv4.tcp_congestion_control = htcp

~~~
stock_toaster
have you noticed much of a change with htcp as the congestion control algo?

------
ianshward
_"A large part of this is due to the fact that nginx only uses HTTP/1.0 when
it proxies requests to a back end server, and that means it opens a new
connection on every request rather than using a persistent connection"_

Have you tried using upstream keepalive
[http://nginx.org/en/docs/http/ngx_http_upstream_module.html#...](http://nginx.org/en/docs/http/ngx_http_upstream_module.html#keepalive)
This should help keep the number of connections, and thus ephemeral port and
tcp memory loading down.

As for node.js, core only ever holds a connection open for once through the
event loop, and even then, only if there are requests queued. If you have any
kind of high volume tcp client in node, this will also cause issues w/
ephemeral port exhaustion and thus tcp memory loading. Check out
<https://github.com/TBEDP/agentkeepalive> in that case. Related to tcp memory
load issues in general, this is a helpful paper
<http://www.isi.edu/touch/pubs/infocomm99/infocomm99-web/>

------
aidos
There's some good info in here. We ran a flash hotel sale a while back. Only
lasted for 60 seconds but with about 800 booking req/second. Discovered many
of the same issues but I never quite got iptables stable (hash table flooding,
then other issues) so I ended up getting it to ignore some of the traffic.
Will try out the solutions in here next time to see how it goes.

------
dfc
Does anyone know why nf_conntrack_tcp_timeout_established is set to such a
high value? Five days seems like an awful long time.

~~~
ck2
Yeah that was a good take-away from the article. I was not aware of that
either.

I would guess it's to allow long connections for ssh or similar without
timeouts, but there are other ways to prevent timeouts without it eating all
those resources.

I set it to 1800 myself, we'll see how that goes.

I noticed memory use went down after that.

------
ck2
I've found the low hanging fruit is to add _initcwnd 10_ to your ip route and
setting _tcp_slow_start_after_idle=0_

~~~
bluesmoon
that helps reduce latency, but if you can't accept new connections, it's of no
use.

Also note that initcwnd is set to 10 by default on all current OSen.

~~~
ck2
CentOS is used on a great percentage of webservers and it does not set
initcwnd to 10 by default.

Sadly we have to wait even longer for _initrwnd_ support (minimum 2.6.38
kernel)

~~~
harshreality
elrepo has up to date kernels for centos.

------
koenigdavidmj
Do people seriously run out of ephemeral ports before they run out of server
memory?

~~~
ajross
There are ~32k ephemeral ports. Typical servers have ~32G of memory. It's
certainly not hard to imagine a request architecture where a single request
can be handled in less than 1MB of per-request memory.

~~~
koenigdavidmj
Yes, but you don't need a separate port for every request. Is it reasonable to
believe that a single client will have that many conns open?

~~~
charliesome
In this case, they were reverse proxying with an old nginx that only supported
HTTP/1.0 for backend connections. So they do need an ephemeral port for each
request.

