Hacker News new | past | comments | ask | show | jobs | submit login
Linux TCP/IP Tuning (lognormal.com)
162 points by bluesmoon on Sept 27, 2012 | hide | past | favorite | 31 comments

At last! In the past we had to recompile FreeBSD TCP stack on nginx frontend to change some constants like hardcorded first SYN timeout.

yep, unfortunately it will be a while before it gets into apt

If you're running nginx in production on Ubuntu or Debian, you'd probably be well-served by using their packages rather than the distribution default ones.


thanks, will look into that

In order for a collision to take place, we’d have to get a new connection from an existing client, AND that client would have to use the same port number that it used for the earlier connection, AND our server would have to assign the same port number to this connection as it did before.

Ephemeral ports aren't assigned to inbound connections, they're used for outbound connections. So, for the client-to-nginx connection, both the server IP and port are fixed (the port will be either 80 or 443) - only the client IP and port change, so for a collision all you need is for a client to re-use the same port on its side quickly.

For the nginx to node connection, both IPs and the server port are fixed, leaving only the ephemeral port used by nginx to vary. You don't have to worry about out-of-order packets here though, since the connection is loopback.

Note that only the side of the connection that initiates the close goes into TIME_WAIT - the other side goes into a much shorter LAST_ACK state.

For the nginx to node connection, it's a unix socket. No issue.

So just to be sure, some noobs questions: The 'Ephemeral Ports' and the 'TIME_WAIT state' tricks are here to handle the connections from nginx to Node.js (not for the client to nginx)?

Socket from client to nginx are well identified by the client IP an the client port. On each client request, nginx create a new socket to node.js?

There can be more than one node.js intance running? That's the main goal of nginx here, or there is some additional benifices? > Edit, ok: "nginx is used for almost everything: gzip encoding, static file serving, HTTP caching, SSL handling, load balancing and spoon feeding clients" http://blog.argteam.com/coding/hardening-node-js-for-product...

Really surprised this article doesn't mention tcp_tw_reuse or tcp_tw_recycle. These have a more substantial impact that simply adjusting TW, as those ports will still be in a FIN_WAIT status for a long time before reuse as well.

Excellent article on the subject.: http://www.speedguide.net/articles/linux-tweaking-121

Agreed. There are a lot of other possible optimizations, from the often-mentioned buffer size settings:

  net.core.rmem_max / net.core.wmem_max
  net.ipv4.tcp_rmem / net.ipv4.tcp_wmem
to metric tunings like:

  net.ipv4.tcp_no_metrics_save / net.ipv4.tcp_moderate_rcvbuf
I've been playing around with these settings on very loaded machines:

  # Retry SYN/ACK only three times, instead of five
  net.ipv4.tcp_synack_retries = 3
  # Try to close things only twice
  net.ipv4.tcp_orphan_retries = 2
  # FIN-WAIT-2 for only 5 seconds
  net.ipv4.tcp_fin_timeout = 5
  # Increase syn socket queue size (default: 512)
  net.ipv4.tcp_max_syn_backlog = 2048
  # One hour keepalive with fewer probes (default: 7200 & 9)
  net.ipv4.tcp_keepalive_time = 3600
  net.ipv4.tcp_keepalive_probes = 5
  # Max packets the input can queue
  net.core.netdev_max_backlog = 2500
  # Keep fragments for 15 sec (default: 30)
  net.ipv4.ipfrag_time = 15
  # Use H-TCP congestion control
  net.ipv4.tcp_congestion_control = htcp

have you noticed much of a change with htcp as the congestion control algo?

On 2.6.3x, someone posted a year or two ago to one of the linux mailing lists demoing an ipv6 stack hang under high traffic when tcp_tw_recycle is set to true.

Be very careful and test it yourself.

edit: http://www.spinics.net/lists/netdev/msg154040.html

I stand to correction but playing with tcp_tw_recycle, may cause dropped frames with load-balancing and NATs.

Scroll down to 'Networking' and read the notice.

tcp_tw_reuse and tcp_tw_recycle are dangerous. We have seen a significant number of connections from clients behind a NAT gateway being dropped with tcp_tw_recycle = 1.

Absolutely use with caution and test extensively. Everyone one of these tweaks are tuning things at a very granular level and may cause more problems than help.

Btw, the dropped clients has to do with recycle -- reuse is far 'safer', protocol speaking.

thanks for the link. the article doesn't talk about those settings because I'd never tried them.

_"A large part of this is due to the fact that nginx only uses HTTP/1.0 when it proxies requests to a back end server, and that means it opens a new connection on every request rather than using a persistent connection"_

Have you tried using upstream keepalive http://nginx.org/en/docs/http/ngx_http_upstream_module.html#... This should help keep the number of connections, and thus ephemeral port and tcp memory loading down.

As for node.js, core only ever holds a connection open for once through the event loop, and even then, only if there are requests queued. If you have any kind of high volume tcp client in node, this will also cause issues w/ ephemeral port exhaustion and thus tcp memory loading. Check out https://github.com/TBEDP/agentkeepalive in that case. Related to tcp memory load issues in general, this is a helpful paper http://www.isi.edu/touch/pubs/infocomm99/infocomm99-web/

There's some good info in here. We ran a flash hotel sale a while back. Only lasted for 60 seconds but with about 800 booking req/second. Discovered many of the same issues but I never quite got iptables stable (hash table flooding, then other issues) so I ended up getting it to ignore some of the traffic. Will try out the solutions in here next time to see how it goes.

Does anyone know why nf_conntrack_tcp_timeout_established is set to such a high value? Five days seems like an awful long time.

Yeah that was a good take-away from the article. I was not aware of that either.

I would guess it's to allow long connections for ssh or similar without timeouts, but there are other ways to prevent timeouts without it eating all those resources.

I set it to 1800 myself, we'll see how that goes.

I noticed memory use went down after that.

I've found the low hanging fruit is to add initcwnd 10 to your ip route and setting tcp_slow_start_after_idle=0

that helps reduce latency, but if you can't accept new connections, it's of no use.

Also note that initcwnd is set to 10 by default on all current OSen.

CentOS is used on a great percentage of webservers and it does not set initcwnd to 10 by default.

Sadly we have to wait even longer for initrwnd support (minimum 2.6.38 kernel)

elrepo has up to date kernels for centos.

Do people seriously run out of ephemeral ports before they run out of server memory?

There are ~32k ephemeral ports. Typical servers have ~32G of memory. It's certainly not hard to imagine a request architecture where a single request can be handled in less than 1MB of per-request memory.

Yes, but you don't need a separate port for every request. Is it reasonable to believe that a single client will have that many conns open?

In this case, they were reverse proxying with an old nginx that only supported HTTP/1.0 for backend connections. So they do need an ephemeral port for each request.

so we've made a few changes since I'd initially done the ephemeral port tuning, the most important being switching to unix domain sockets rather than TCP. with that, we probably no longer need the ephemeral port setting.

Yes. In some circumstances.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact