A typical CS student's view of the network is that "everything is so random that it averages out." The exact opposite is actually the case most often, to the chagrin of your local network admin.
Network traffic tends to act like everyone knew you were going to youtube right now, and everybody jumped on the network all at once. The statistical term is "self-similarity" and the Hurst Parameter (H) measures how badly a network's traffic is _not_ averaging out, but bursting as if everyone knew you were going to youtube.
It might help to mention another place we see self-similarity: fractals.
This article just breaks down the situation in a cluster. Again, self-similar traffic patterns mean everyone tries to talk at once, and their TCP stacks all back off randomly, so the total bandwidth of the network is poor.
Unfortunately, the blog post says, "What’s the remedy? We don’t have a good remedy for this yet."
Sure we do. Please google some of the relevant terms for great articles on network traffic analysis and optimization. For example "self similar network traffic" and "hurst parameter." Even the CMU site linked from the blog post has a great writeup under the section "SOLUTIONS" :) 
Additionally, as a comment on the blog points out, larger buffers on routers can be a really _bad_ thing! Buffer bloat tends to hide core issues with larger latencies, but not solve them.
One of the main reasons for incast is the synchronised bursts+backoffs causing senders to timeout. As the CMU paper pointed out, the 200ms min-RTO is too conservative for senders to recover from timeouts. Reducing it can go a long way in mitigating the incast effect.
There's a project called "R2D2" at Stanford University that proposed adding a tiny shim layer that rapidly retransmits lost TCP segments. It was done in a transparent manner to hide packet losses from TCP thus preventing TCP from experiencing a timeout. You can read more about it here: http://sedcl.stanford.edu/files/r2d2.pdf. EDIT: The highlight (relevant to the blog post) is that it is a loadable kernel module requiring no changes to TCP stack!
Disclaimer: I work in the same group, and I am familiar with R2D2.
The R2D2 work is pretty neat: different than a lot of other approaches I've seen. I'm excited to see how the FPGA implemention works!
Some comments: 1) I think any significant change to the control algorithm requires careful analysis: the variance in throughput with the multi-client experiment looks interesting, though I don't know whether that is steady state. From the graphs, R2D2 suffers more with larger filesizes whereas TCP actually improves. 2) Real datacenters can have very different traffic patterns that can break some of the assumptions about bandwidth uniformity and latency, though it's harder for academics to tackle that. 3) If you are going down the path of TCP offload, you presumably can avoid the overhead of CPU interrupts/timer programming when reducing the RTO into microseconds :). I'd be interested in seeing how R2D2's algorithms/constants work when you're able to reduce the 3ms timer to microseconds in hardware!
Also, if some kernel programmer wants to fix my once-working patch to support microsecond-granularity TCP retransmissions , I personally know a bunch of people who would be happy :)
Cisco 4948s are better positioned for top of rack and are more comparable to the Juniper EX4200s that were shown to be better in the Tolly Report shown in the post.
I don't know anything about Erlang or the software they're dealing with, but I'm surprised that "program your software to avoid unnecessary head-of-line blocking" didn't make the list.
From what I've read, I think you could create a separate process dedicated to sending stuff to B, and have other processes deal with other clients. It's probably not possible, but it would also be nice if you could detect these hung processes and start more workers to deal with other clients, each dealing with potentially more then one client. (yes, I may be totally wrong)
look at rto_min, rto_max, rto_initial.
you can do something like:
ip route replace dev eth0 rto_min 10ms