- What field is this presentation about? Ok, TCP congestion control. What is congestion control?
- Why is congestion control important?
- What (types of) solutions exist?
- Why do we want to obtain a census of the solutions in the field?
- How have prior ones been obtained?
- How are IETF decisions guided/impacted by these results?
I understand that these introductions are often annoying to write (and they are not appropriate for a presentation held among peers that all mostly know the answers to the above), but maybe someone could provide a quick summary so that laypeople like me can approach this publication? Or maybe there actually is an associated paper?
If you want some context answering your first three questions, the subject is TCP congestion control, and there's a Wikipedia article about it (https://en.wikipedia.org/wiki/TCP_congestion_control). I don't know the answers to your last three questions, since I haven't been following the field; this presentation was mostly targetted to those who are following the field.
Basically, this is a really hard problem, since you can measure bandwidth only indirectly (and with much delay), and the changes you do will affect the measurements. Solutions exist in the form of reacting to packet loss (ACKs don't come back, or selective ACKs come back to explicitly tell you something was missing), to ECN (routers along the path setting bits in your header to explicitly tell you it's getting tight on bandwidth) and to changes in delay (as measured by ACK timing and embedded TCP timestamps; higher delay typically means you're building up a queue somewhere, which is bad). They also differ in what rate they start at (initial window), how aggressive and how they increase the rate when things appear to go well (exponential, linear, binary search), and how they decrease the rate when they get negative signals (cut in half, other tactics). An important factor is “fairness”, usually to TCP Reno (a classic congestion algorithm from the 80s); if you and another sender send at the same time down the same path and they use some other congestion control algorithm, you should get about half of the bandwidth each. (An algorithm that's too aggressive could easily cause the other side to lose packets and just keep decreasing its bandwidth estimate, crowding it off the link entirely.)
A perfect congestion algorithm would immediately detect the correct amount of bandwidth (even in the presence of changing conditions and non-congestion-related packet loss such as a bad cable or Wi-Fi errors), send evenly at that rate, never send so much that packets are lost to congestion, never send so much that routers' queues build up and latency increases (bufferbloat) and be perfectly fair to all other algorithms.
As for census; it's interesting to know what exists out there, what people are doing in practice, and what algorithms we need to compete against and be fair to.
It is possible, in TCP, to lie and make the sender flood the route, by acknowledging packets not actually received. Protocols that allow requests for retransmission of ranges enable exploiting this loophole, but it does not seem often abused.
When you can trust the receiver implicitly, you can accept reports of end-to-end forward packet transmission time from it, and feed those to a predictor. When the time is increasing, the sending rate is too high; when decreasing, the rate could be higher. This allows ignoring packet loss as a signal, which is good because it is very noisy; it is caused by many things other than congestion.
Given a predictor, control theory provides easy means to tune sending rate exactly to the instantaneous capacity of the channel, and vary as other traffic begins and ends.
(Interestingly, round-trip time does not work for this. The channel carrying ACKs has completely different characteristics, so RTT is nearly useless. That does not stop people from trying!)
TCP, at its best, fails to use most of the channel, when RTT gets long. It is possible to adjust the sending rate in real time to consume exactly all of the usable bandwidth that TCP does not, so that TCP sees no effect from the better-controlled traffic, and is only slowed by sharing with other TCP streams.
The above is the basis for the Aspera file-transfer system, for which IBM still gets huge license fees from military, medical, extraction, and media customers.
What do you mean? My ISP knows which package I bought and how congested the tubes are, and how full the datatanks are.
Tracing route to news.ycombinator.com [126.96.36.199]
over a maximum of 30 hops:
1 1 ms 1 ms <1 ms 192.168.86.1
2 12 ms 11 ms 6 ms x.x.x.x
3 29 ms 32 ms 24 ms y.y.y.y
4 19 ms 21 ms 22 ms be23.clmkohpe02r.midwest.rr.com [188.8.131.52]
5 28 ms 22 ms 22 ms be27.clmkohpe01r.midwest.rr.com [184.108.40.206]
6 19 ms 23 ms 19 ms bu-ether31-vinnva0510w-bcr00.tbone.rr.com [220.127.116.11]
7 33 ms 31 ms 20 ms ae-13.edge2.Chicago10.Level3.net [18.104.22.168]
8 68 ms 69 ms 69 ms ae-0-11.bar1.SanDiego1.Level3.net [22.214.171.124]
9 71 ms 75 ms 69 ms M5-HOSTING.bar1.SanDiego1.Level3.net [126.96.36.199]
10 69 ms 77 ms 66 ms news.ycombinator.com [188.8.131.52]
Hop 2 is my router to my ISP
Hop 3, 4, 5, 6 are all routers within my ISPs network.
Hop 7 is from my ISP into "the internet". Level3 is a major provider of interconnectivity.
Hops 8 and 9 are within Level3's network.
Hop 10 is finally to Hacker News, which indicates HN's "ISP" is Level3 directly.
Only hop 2 is where my "400 Mbps" package limitation is in place.
Only hop 10 is "server-to-their-ISP".
I chose 10-20% as my intuition because, as you see, 10 routers were involved to get my packet from my PC to HN, and only 1 of them is constrained by my internet package.
news.ycombinator.com is no exception that it doesn't work:
5 a9k2-49-infra-dc2.dc3.poneytelecom.eu (184.108.40.206) 0.745 ms 220.127.116.11 (18.104.22.168) 0.865 ms a9k2-49-infra-dc2.dc3.poneytelecom.eu (22.214.171.124) 0.720 ms
6 lag-110.ear3.Paris1.Level3.net (126.96.36.199) 0.940 ms ae53.edge7.Paris1.Level3.net (188.8.131.52) 9.700 ms lag-110.ear3.Paris1.Level3.net (184.108.40.206) 0.955 ms
7 ae-0-11.bar1.SanDiego1.Level3.net (220.127.116.11) 139.415 ms 139.085 ms 139.031 ms
8 GIGLINX-INC.bar1.SanDiego1.Level3.net (18.104.22.168) 150.362 ms M5-HOSTING.bar1.SanDiego1.Level3.net (22.214.171.124) 140.286 ms 139.842ms
9 * * *
10 * * *
11 * * *
12 * * *
13 * * *
By experimenting a bit I found that when sending UDP packets the traceroute is incomplete, but when sending ICMP also my traceroute is complete. Explains how the grandparent could present a full traceroute.
I had just taken it as a fact that full traceroutes are rarely possible these days. For reasons like you say, it's overhead for them and they don't want you to do it.
6: ae65.edge1.Stockholm1.Level3.net 35.147ms asymm 7
7: ae-0-11.bar1.SanDiego1.Level3.net 457.605ms asymm 16
8: M5-HOSTING.bar1.SanDiego1.Level3.net 199.025ms asymm 17
9: no reply
10: no reply
11: no reply
12: no reply
Not working the field, but I believe it's an incredibly hard statistical model, not that far away from random chaos for a remote observer in the general case.
I don’t see the point in every presenter reinventing (probably poorly) the same “bringing you up to speed on core technology xxxx” in every presentation.
> they are not appropriate for a presentation held among peers that all mostly know the answers to the above
My comment was merely requesting that someone else provides such a summary, so that the linked material becomes accessible to a wider audience. I expect that the majority of readers here are not particularly familiar with this problem/field, so I thought it would be beneficial to have such a complementary explanation in the thread. (Yes, you can tell people to spend 10 minutes reading through Wikipedia and other supplementary material, but few will do so.) And indeed, I could get significantly more value out of this submission after reading some of the explanations I solicited. Others hopefully/presumably had the same experience (instead of just having to move on).
I didn't want to comment on the downvotes I received, but I would appreciate suggestions how my wording could've been improved to avoid the reaction that you had. It was precisely what I intended to curb with the quoted remark, but it seems it was not very effective.
Your first paragraph ("sorely missing" "required to have") reads like a criticism. The first sentence of the last paragraph ("annoying to write") sounds like "I know you don't always enjoy eating vegetables, but...".
If you meant to just ask for a summary as a non-specialist, next time just say "Could someone explain the context of this to a layperson like me? Specifically, ..."
the internet has no flow control - any sender can issue as many packets as their egress link will bear. Since the subsequent links are over-provisioned, if the sum of the input rates exceeds the output rate, some packets may need to be dropped.
good or ill - the solution is to put a controller on the sender that examines the traffic statistics and tries to adapt its sending rate so as not to overflow the network capacity.
since everyone is doing this, we also kind of want to make sure that everyone sharing an output link adapts to something close to a fraction of that bandwidth divided by the total number of TCP flows. this is fairness.
without getting into the highly involved details you can already see that this is a really interesting control/game theory problem.
I find it particularly weird, astounding even, that this situation is really quite ideal for a tragedy of the commons. except there is no tragedy. everyone can cheat, but almost no one does. sure - you'd have to recompile your kernel, but anyone could ship 'network turbocharger 2.0' that forces all the other TCPs on the network to back off and allocate as much intermediate bandwidth as you like.
a very easy way to do this as Netscape discovered is to use multiple connections. the pieces are all the same size, but you can take as many as you like.
obviously that situation degenerates very quickly. but it hasn't. and it leaves the designers of wide area services that don't just use one TCP connection quite at a loss about how much bandwidth they should take.
as far as the survey goes, I can think of all sorts of reasons why the IETF would care. one reason is that uptake of new work is pretty hard - and has killed lots of great protocol work. I'm pretty surprised that BBR has such rapid uptake - the technical situation remains a little unclear.
for me the interesting part is the discussion of the measurement technique
Try running a site with big files for downloads. Every week or so, you'll be visited by someone who runs 200+ simultaneous connections for the same file, keeps opening more, and ignores HTTP errors (like “too busy”).
There are also the “IW wars”, where CDNs compete by increasing their initial window more and more to gain an edge. This will be exponentially worse with QUIC, when tuning these knobs becomes more accessible to non-experts.
That's the point of "fair queuing". The idea is that if you send too much, you just compete with your own packets, and thus are not rewarded for sending too much. This discourages hogging the bandwidth. But fair queuing is computationally expensive, and is rarely seen in core routers, where some FPGA is doing routing. It's also not seen enough in home routers, where fast LAN hits slower uplink and a FIFO queue gets long, increasing latency. That's "bufferbloat". Quite fixable. Now get AT&T and Comcast to cooperate.
From the paper, "Classic congestion control questions on fairness guarantees and convergence need to be re-answered for the new rate-based congestion control paradigm." Yes. They're looking at performance for individual nodes.
This will be exponentially worse with QUIC
Or when Google uses it to improve their responsiveness vs. others?
As I mentioned in a sibling comment, BBR has high uptake among Google properties, but not necessarily much outside; many of the other big players are still using CUBIC, since it's tried and tested. Raw counts aren't a super useful metric, especially if many sibling domains (e.g. google.com, google.com.hk) can count toward that same number. That said, BBR is available as an option in the mainline Linux kernel, and I wouldn't be surprised to see at least a bit of cargo culting explaining its popularity ("it works for Google, so...").
(Alexa top 250 sites are no longer free to view, apparently, but the top 50 contain at least four google properties. Other website rankings: https://moz.com/top500 shows no fewer than 45 hits for "google", not to mention other properties such as Youtube or Blogger; https://trends.netcraft.com/topsites has 33 hits for "Google LLC" out of their top 100.)
I'm personally surprised by the widespread use of algorithms outside of Cubic, personally. Some of these are obviously in-house, e.g. Akamai CC, and there are some cases of, say, CTCP from Windows Server 2008 machines, but Cubic is the default on Linux, newer versions of Windows, and even macOS, which I assume covers a large majority of servers.
As you said, it's available in linux behind a simple sysctl and here's a cloudflare blog post describing just that. And netflix contributed the equivalent to freebsd.
And here's a blog post about AWS cloudfront seeing improvements from it. TFA also mentions Steam.
So it's more than just google.
Network congestion is a little bit like traffic congestion. There are a bunch of dissimilarities, but the important thing is that if there's a lot of congestion, then everything slows to a halt.
(My terminology will be a bit loose, since I assume laypeople aren't as interested in the nitty gritty details of the bandwidth-delay product, or the specifics of how congestion windows work.)
So, the primary job of congestion control is to determine how much capacity there is on the network, and to avoid going over that.
There are additional desirable properties, such as fairness (sharing the network evenly), or efficiency (sending as close to the capacity as possible), which other solutions might try to improve upon.
(For all of these, the core of the congestion control is in the steady state)
Some well known options in the space:
- TCP (New) Reno: a series of improvements that use at their core the principle of "additive increase, multiplicative decrease (AIMD)". Basically, you're always probing the network by increasing slowly, but when you see a packet drop, assume that it's due to congestion, and immediately halve the amount of bandwidth you're using. This performs well in terms of fairness, but has poor efficiency characteristics (see: "tcp sawtooth" which in the limit might achieve efficiency of ~75%)
- TCP CUBIC: the premise behind CUBIC is to improve upon New Reno's efficiency. Instead of linearly climbing back up, what if we probed via a cubic spline interpolation? Thus, we climb quickly at first, then slow down as we approach our previous sending rate (pre-drop), then speed up once we pass that limit in case there was a period of congestion that's since cleared up.
- TCP BBR: stands for "bufferbloat reduction". Bufferbloat is the increase in latency that may precede drops due to congestion, whereas, say, issues with a flaky wifi connection might not manifest in the same way. So, the goal is to detect congestion earlier than relying on packet drops (which are jarring but not always due to congestion). Previous attempts to take advantage of latency as an early detection signal such as Vegas had issues with adoption around fairness, in that they would back off more quickly than, say, New Reno or CUBIC, and thus would achieve lower throughput; BBR is more aggressive in some situations but less aggressive in others.
Why do we want a census? So we can see what's out there, what adoption rates look like, and how widely newer protocols are deployed in the greater ecosystem. For instance, BBR is called out as serving >40% of downstream internet traffic, but over 39% of downstream traffic is from Google properties using BBR (unsurprising, given that BBR came out of Google).
How have prior reports been acquired? 2011 report: https://cse.unl.edu/~xu/research/TCPcensus.html
There's a lot of existing work around inferring TCP congestion control (especially how the congestion window evolves in reaction to loss), and I'm sure this survey made use of a fair amount of the existing literature.
(I don't know how IETF decisions have been impacted in the past, so I can't answer that one. But I imagine that this helps inspire further research in the field.)
Edit: Also thanks to the other replies that have pointed out the interesting challenges involved. Don't want to comment on each one individually.
I explain elsewhere in this thread the optimal algorithm, that is sadly not compatible with TCP's conditions.
From downloading the PDF and skimming the raw bytes, I see:
<pdf:Producer>Microsoft® PowerPoint® 2016</pdf:Producer></rdf:Description>
Yup, a quick google shows that it is actually a variant of the built-in "Integral" theme in Microsoft PowerPoint.
I suppose the lesson here is that talent matters more than the tool.
Sincere thanks for the research!
slide 27 and onwards
A survey of internet properties and how they each solve this problem for their own domains seems reasonable to consider as "serious detail."