

Lessons learned tuning TCP and Nginx in EC2 - jlintz
http://engineering.chartbeat.com/2014/01/02/part-1-lessons-learned-tuning-tcp-and-nginx-in-ec2/

======
colmmacc
Disclaimer: I work on Amazon Route 53 and Elastic Load Balancer.

From the article ... "Unfortunately there are a large number of (misbehaving)
DNS servers out there that don’t properly obey TTLs on records and will still
serve up stale records for an indefinite amount of time."

I would push back on how large this number is in general, whenever we
experiment with DNS weights, we see about 98% of browser clients honouring the
change and 99% within 5 minutes. But with mobile networks and java clients,
things can be different. Mobile networks commonly have very few resolvers and
so few answers in the mix to distribute load, and some versions of java cache
answers forever by default.

Here's the hack we use to help with these situations! when you control the
client it will help, it's something we've worked with some mobile app authors
on. With Route 53 (and hopefully Dyn too, I'm not sure), you can configure a
wildcard name to be a series of weighted entries, backed by health checks. So
instead of;

    
    
       ping.chartbeat.net weight=1 answer=192.0.2.1 healthcheck=111
       ping.chartbeat.net weight=1 answer=192.0.2.2 healthcheck=222
    

it can be configured as;

    
    
       *.ping.chartbeat.net weight=1 answer=192.0.2.1 healthcheck=111
       *.ping.chartbeat.net weight=1 answer=192.0.2.2 healthcheck=222
    

so pretty much the same, but then you have the client look up;

    
    
       [ some random nonce / guid ].ping.chartbeat.net
    

and voilà - you have busted any intermediate cache, and load is also spread
more evenly (there are usually many more clients than DNS resolvers).

Self-promotion: If you do choose to use ELB and Route 53, we also support
wildcard ALIASes to ELBs, and the queries are handled free of charge.

~~~
jlintz
Interesting solution! Wasn't aware Route 53 handles those types of requests
free of charge, certainly makes it a cost effective solution if you're on
Route 53. Thanks for the response

------
donavanm
Instead of netstat(8) or ss(8) check out /proc/net/sockstat and
/proc/net/netstat and /proc/net/tcp. Might as well save a fork and some
context switches.

    
    
      net.ipv4.tcp_rmem=8192 873800 8388608
      net.ipv4.tcp_wmem=4096 655360 8388608
      net.ipv4.tcp_mem=8388608 8388608 8388608
    

You may want to rethink this. Your default values would support initial send
and receive windows of 400 & 600 packets. Ive ever seen initial windows that
high in the wild. If it's a client you've seen recently they should be in the
peer cache already. With this default receive allocation you only get 39,000
sockets mx. And once you exceed tcp_mem high your sockets will be force closed
with a RST sent to the other side. Much better to have 'pressure' kick in and
limit the buffers, throttling the send & receive windows.

Go look at 'mem' in sockstat. I'd guess your average utilization is more in
the 50kB range. And that includes both send and receive and the tcp_info
structs, IIRC.

    
    
      net.ipv4.tcp_max_orphans=262144
    

That seems incredibly high, Id expect more in the ~5,000 range on a very busy
host. Check your 'orphans' from sockstat.

    
    
      net.core.netdev_max_backlog = 16384

From the source comments this is actually a per CPU packet backlog, havent
verified the implementation though.

    
    
      net.ipv4.tcp_max_tw_buckets=6000000

You may not need to do this. sysctl_max_tw_buckets limits the number of
entries in the TIME_WAIT queue. When a socket moves to TIME_WAIT and the list
is full it will instead go directly to CLOSE. Not very polite, and its
_possible_ you fail to retrans data, but IMHO a low risk scenario. See what
level youre actually running at in sockstat.

    
    
      tcp_tw_recycle

The worst sysctl name ever. The useful part is setting the TIME_WAIT timer to
socket RTO instead of TCP_TIMEWAIT_LEN (60 seconds). The terrible behavior is
in tcp_v4_conn_request() of tcp_ipv4.c. The sysctl also enables strict
timestamp & sequence checking on SYNs. If peers behind behind a NAT device
have clocks > 1 second apart their SYNs will be silently dropped. IIRC
PawsPassive from /proc/net/netstat will be incremented for each drop.

    
    
      tcp_tw_reuse

See tcp_twsk_unique() in tcp_ipv4.c. IIRC when you request a new ephemeral
socket it's checked against the timewait socket list. If sysctl_tcp_tw_reuse
is set and the TIME_WAIT socket is older than one second it can be reused.
Normally TIME_WAIT sockets are aged out of the queue after TCP_TIMEWAIT_LEN,
~60 seconds.

On TIME_WAIT in general you should probably look in to setting the Maximum
Segment Lifetime to a more reasonable value than 60s. You want to cover your
max client RTO + one or two retrans. IMO something like 10s may be too short,
but I can not imagine 30 not working splendidly. See TCP_TIMEWAIT_LEN &
TCP_PAWS_MSL & wherever else I'm missing the header values.

~~~
jlintz
Great info, thanks! going to look into these

