Linux TCP/IP Tuning

vanni · on Sept 27, 2012

"nginx only uses HTTP/1.0 when it proxies requests to a back end server"

No more:

http://www.quora.com/Why-doesnt-Nginx-talk-HTTP-1-1-to-upstr...

http://nginx.org/en/docs/http/ngx_http_upstream_module.html#...

http://mailman.nginx.org/pipermail/nginx/2011-August/028324....

mike_esspe · on Sept 28, 2012

At last! In the past we had to recompile FreeBSD TCP stack on nginx frontend to change some constants like hardcorded first SYN timeout.

bluesmoon · on Sept 27, 2012

yep, unfortunately it will be a while before it gets into apt

laymil · on Sept 28, 2012

If you're running nginx in production on Ubuntu or Debian, you'd probably be well-served by using their packages rather than the distribution default ones.

http://nginx.org/en/download.html

bluesmoon · on Sept 28, 2012

thanks, will look into that

caf · on Sept 28, 2012

In order for a collision to take place, we’d have to get a new connection from an existing client, AND that client would have to use the same port number that it used for the earlier connection, AND our server would have to assign the same port number to this connection as it did before.

Ephemeral ports aren't assigned to inbound connections, they're used for outbound connections. So, for the client-to-nginx connection, both the server IP and port are fixed (the port will be either 80 or 443) - only the client IP and port change, so for a collision all you need is for a client to re-use the same port on its side quickly.

For the nginx to node connection, both IPs and the server port are fixed, leaving only the ephemeral port used by nginx to vary. You don't have to worry about out-of-order packets here though, since the connection is loopback.

Note that only the side of the connection that initiates the close goes into TIME_WAIT - the other side goes into a much shorter LAST_ACK state.

nviennot · on Sept 28, 2012

For the nginx to node connection, it's a unix socket. No issue.

olivier1664 · on Oct 1, 2012

So just to be sure, some noobs questions: The 'Ephemeral Ports' and the 'TIME_WAIT state' tricks are here to handle the connections from nginx to Node.js (not for the client to nginx)?

Socket from client to nginx are well identified by the client IP an the client port. On each client request, nginx create a new socket to node.js?

There can be more than one node.js intance running? That's the main goal of nginx here, or there is some additional benifices? > Edit, ok: "nginx is used for almost everything: gzip encoding, static file serving, HTTP caching, SSL handling, load balancing and spoon feeding clients" http://blog.argteam.com/coding/hardening-node-js-for-product...

meritt · on Sept 27, 2012

Really surprised this article doesn't mention tcp_tw_reuse or tcp_tw_recycle. These have a more substantial impact that simply adjusting TW, as those ports will still be in a FIN_WAIT status for a long time before reuse as well.

Excellent article on the subject.: http://www.speedguide.net/articles/linux-tweaking-121

semenko · on Sept 27, 2012

Agreed. There are a lot of other possible optimizations, from the often-mentioned buffer size settings:

  net.core.rmem_max / net.core.wmem_max
  net.ipv4.tcp_rmem / net.ipv4.tcp_wmem

to metric tunings like:

  net.ipv4.tcp_no_metrics_save / net.ipv4.tcp_moderate_rcvbuf

I've been playing around with these settings on very loaded machines:

  # Retry SYN/ACK only three times, instead of five
  net.ipv4.tcp_synack_retries = 3
  # Try to close things only twice
  net.ipv4.tcp_orphan_retries = 2
  # FIN-WAIT-2 for only 5 seconds
  net.ipv4.tcp_fin_timeout = 5
  # Increase syn socket queue size (default: 512)
  net.ipv4.tcp_max_syn_backlog = 2048
  # One hour keepalive with fewer probes (default: 7200 & 9)
  net.ipv4.tcp_keepalive_time = 3600
  net.ipv4.tcp_keepalive_probes = 5
  # Max packets the input can queue
  net.core.netdev_max_backlog = 2500
  # Keep fragments for 15 sec (default: 30)
  net.ipv4.ipfrag_time = 15
  # Use H-TCP congestion control
  net.ipv4.tcp_congestion_control = htcp

stock_toaster · on Sept 28, 2012

have you noticed much of a change with htcp as the congestion control algo?

harshreality · on Sept 27, 2012

On 2.6.3x, someone posted a year or two ago to one of the linux mailing lists demoing an ipv6 stack hang under high traffic when tcp_tw_recycle is set to true.

Be very careful and test it yourself.

edit: http://www.spinics.net/lists/netdev/msg154040.html

flojo · on Sept 27, 2012

I stand to correction but playing with tcp_tw_recycle, may cause dropped frames with load-balancing and NATs.

Scroll down to 'Networking' and read the notice.

flojo · on Sept 27, 2012

Oops forgot the URL https://wiki.archlinux.org/index.php/Sysctl

lazyjones · on Sept 27, 2012

tcp_tw_reuse and tcp_tw_recycle are dangerous. We have seen a significant number of connections from clients behind a NAT gateway being dropped with tcp_tw_recycle = 1.

meritt · on Sept 27, 2012

Absolutely use with caution and test extensively. Everyone one of these tweaks are tuning things at a very granular level and may cause more problems than help.

Btw, the dropped clients has to do with recycle -- reuse is far 'safer', protocol speaking.

bluesmoon · on Sept 28, 2012

thanks for the link. the article doesn't talk about those settings because I'd never tried them.

ianshward · on Sept 27, 2012

_"A large part of this is due to the fact that nginx only uses HTTP/1.0 when it proxies requests to a back end server, and that means it opens a new connection on every request rather than using a persistent connection"_

Have you tried using upstream keepalive http://nginx.org/en/docs/http/ngx_http_upstream_module.html#... This should help keep the number of connections, and thus ephemeral port and tcp memory loading down.

As for node.js, core only ever holds a connection open for once through the event loop, and even then, only if there are requests queued. If you have any kind of high volume tcp client in node, this will also cause issues w/ ephemeral port exhaustion and thus tcp memory loading. Check out https://github.com/TBEDP/agentkeepalive in that case. Related to tcp memory load issues in general, this is a helpful paper http://www.isi.edu/touch/pubs/infocomm99/infocomm99-web/

aidos · on Sept 27, 2012

There's some good info in here. We ran a flash hotel sale a while back. Only lasted for 60 seconds but with about 800 booking req/second. Discovered many of the same issues but I never quite got iptables stable (hash table flooding, then other issues) so I ended up getting it to ignore some of the traffic. Will try out the solutions in here next time to see how it goes.

dfc · on Sept 27, 2012

Does anyone know why nf_conntrack_tcp_timeout_established is set to such a high value? Five days seems like an awful long time.

ck2 · on Sept 27, 2012

Yeah that was a good take-away from the article. I was not aware of that either.

I would guess it's to allow long connections for ssh or similar without timeouts, but there are other ways to prevent timeouts without it eating all those resources.

I set it to 1800 myself, we'll see how that goes.

I noticed memory use went down after that.

ck2 · on Sept 27, 2012

I've found the low hanging fruit is to add initcwnd 10 to your ip route and setting tcp_slow_start_after_idle=0

bluesmoon · on Sept 27, 2012

that helps reduce latency, but if you can't accept new connections, it's of no use.

Also note that initcwnd is set to 10 by default on all current OSen.

ck2 · on Sept 27, 2012

CentOS is used on a great percentage of webservers and it does not set initcwnd to 10 by default.

Sadly we have to wait even longer for initrwnd support (minimum 2.6.38 kernel)

harshreality · on Sept 27, 2012

elrepo has up to date kernels for centos.

koenigdavidmj · on Sept 27, 2012

Do people seriously run out of ephemeral ports before they run out of server memory?

ajross · on Sept 27, 2012

There are ~32k ephemeral ports. Typical servers have ~32G of memory. It's certainly not hard to imagine a request architecture where a single request can be handled in less than 1MB of per-request memory.

koenigdavidmj · on Sept 27, 2012

Yes, but you don't need a separate port for every request. Is it reasonable to believe that a single client will have that many conns open?

haileys · on Sept 28, 2012

In this case, they were reverse proxying with an old nginx that only supported HTTP/1.0 for backend connections. So they do need an ephemeral port for each request.

bluesmoon · on Sept 28, 2012

so we've made a few changes since I'd initially done the ephemeral port tuning, the most important being switching to unix domain sockets rather than TCP. with that, we probably no longer need the ephemeral port setting.

Dobbs · on Sept 27, 2012

Yes. In some circumstances.