

TCP incast: What is it? How can it affect Erlang applications? - hypnotist
http://www.snookles.com/slf-blog/2012/01/05/tcp-incast-what-is-it/

======
sounds
Computer Science includes research into networking. TCP incast is a way of
describing network traffic statistical properties, particularly in data
centers [1]

A typical CS student's view of the network is that "everything is so random
that it averages out." The exact opposite is actually the case most often, to
the chagrin of your local network admin.

Network traffic tends to act like everyone knew you were going to youtube
right now, and everybody jumped on the network all at once. The statistical
term is "self-similarity" and the Hurst Parameter (H) measures how badly a
network's traffic is _not_ averaging out, but bursting as if everyone knew you
were going to youtube.

It might help to mention another place we see self-similarity: fractals.

This article just breaks down the situation in a cluster. Again, self-similar
traffic patterns mean everyone tries to talk at once, and their TCP stacks all
back off randomly, so the total bandwidth of the network is poor.

Unfortunately, the blog post says, "What’s the remedy? We don’t have a good
remedy for this yet."

Sure we do. Please google some of the relevant terms for great articles on
network traffic analysis and optimization. For example "self similar network
traffic" and "hurst parameter." Even the CMU site linked from the blog post
has a great writeup under the section "SOLUTIONS" :) [2]

Additionally, as a comment on the blog points out, larger buffers on routers
can be a really _bad_ thing! Buffer bloat tends to hide core issues with
larger latencies, but not solve them.

[1] <http://en.wikipedia.org/wiki/Long-tail_traffic>

[2] <http://www.pdl.cmu.edu/Incast/>

~~~
xtacy
That's right -- bufferbloat introduces more problems than "solving" incast.

One of the main reasons for incast is the synchronised bursts+backoffs causing
senders to timeout. As the CMU paper pointed out, the 200ms min-RTO is too
conservative for senders to recover from timeouts. Reducing it can go a long
way in mitigating the incast effect.

There's a project called "R2D2" at Stanford University that proposed adding a
tiny shim layer that rapidly retransmits lost TCP segments. It was done in a
transparent manner to hide packet losses from TCP thus preventing TCP from
experiencing a timeout. You can read more about it here:
<http://sedcl.stanford.edu/files/r2d2.pdf>. EDIT: The highlight (relevant to
the blog post) is that it is a loadable kernel module requiring no changes to
TCP stack!

Disclaimer: I work in the same group, and I am familiar with R2D2.

~~~
vrv
The degenerate/extreme/unrealistic case of Incast is you have switch buffer
capacity to store N segments, you talk to M servers that each return one
segment, and M >> N. Although RTO is calculated based on RTT and RTTVAR, in
the extreme case you can get clumps (and waves) of retransmissions (depending
on the properties of the network) such that even eliminating the minRTO
altogether may not solve the problem at some scale. In simulation we
experimented with adding an adaptive staggering to the exponential backoff
algorithm and found that it helped at high server counts [1], but it was only
simulation so I'd take that approach with a grain of salt.

The R2D2 work is pretty neat: different than a lot of other approaches I've
seen. I'm excited to see how the FPGA implemention works!

Some comments: 1) I think any significant change to the control algorithm
requires careful analysis: the variance in throughput with the multi-client
experiment looks interesting, though I don't know whether that is steady
state. From the graphs, R2D2 suffers more with larger filesizes whereas TCP
actually improves. 2) Real datacenters can have very different traffic
patterns that can break some of the assumptions about bandwidth uniformity and
latency, though it's harder for academics to tackle that. 3) If you are going
down the path of TCP offload, you presumably can avoid the overhead of CPU
interrupts/timer programming when reducing the RTO into microseconds :). I'd
be interested in seeing how R2D2's algorithms/constants work when you're able
to reduce the 3ms timer to microseconds in hardware!

Also, if some kernel programmer wants to fix my once-working patch to support
microsecond-granularity TCP retransmissions [2], I personally know a bunch of
people who would be happy :)

[1] <http://vijay.vasu.org/static/papers/sigcomm147-vasudevan.pdf>

[2] <https://github.com/vrv/linux-microsecondrto>

------
Dylan16807
Yikes, that switch is only buffering a third of a millisecond's worth of
packets. Easy to see why the connections would collapse.

~~~
jauer
Yeah. Cisco 3750s are notorious for having small buffers. They are more
appropriately used for connecting desktops in a office.

Cisco 4948s are better positioned for top of rack and are more comparable to
the Juniper EX4200s that were shown to be better in the Tolly Report shown in
the post.

------
ambrop7
> ... head-of-line blocking ... What’s the remedy?

I don't know anything about Erlang or the software they're dealing with, but
I'm surprised that "program your software to avoid unnecessary head-of-line
blocking" didn't make the list.

~~~
ricardobeat
It's in the post. Cmd + F "backpressure/feedback mechanism built in to Erlang"

~~~
ambrop7
Yeah, I get that it may be non-trivial to avoid head-of-line blocking in that
context, but I doubt that it's impossible. If it is impossible, maybe Erlang
just isn't the best tool for the job.

From what I've read, I think you could create a separate process dedicated to
sending stuff to B, and have other processes deal with other clients. It's
probably not possible, but it would also be nice if you could detect these
hung processes and start more workers to deal with other clients, each dealing
with potentially more then one client. (yes, I may be totally wrong)

------
darwinGod
Nice article, with good related references-an excellent read-Didn't know about
Tcp incast before- good way to start a Saturday morning!

------
zurn
Sounds like they're using wrong kind of switches and/or haven't configured
ethernet flow control correctly?

------
zobzu
The question might be, can we use iproute to change TCP's RTO in Linux yet?

~~~
mh-
yes.

[http://www.kernel.org/doc/Documentation/networking/ip-
sysctl...](http://www.kernel.org/doc/Documentation/networking/ip-sysctl.txt)

look at rto_min, rto_max, rto_initial.

you can do something like:

    
    
      ip route replace dev eth0 rto_min 10ms

~~~
zobzu
Thanks!

