Hacker News new | past | comments | ask | show | jobs | submit login
The great Internet TCP congestion control census (ietf.org)
126 points by fanf2 4 months ago | hide | past | favorite | 51 comments

Reading through it I can figure out some context to this research, but for this presentation I'm sorely missing the "introduction" part that a paper would be required to have.

- What field is this presentation about? Ok, TCP congestion control. What is congestion control?

- Why is congestion control important?

- What (types of) solutions exist?

- Why do we want to obtain a census of the solutions in the field?

- How have prior ones been obtained?

- How are IETF decisions guided/impacted by these results?

I understand that these introductions are often annoying to write (and they are not appropriate for a presentation held among peers that all mostly know the answers to the above), but maybe someone could provide a quick summary so that laypeople like me can approach this publication? Or maybe there actually is an associated paper?

Yes, as you noticed, this is a "presentation held among peers that all mostly know the answers to the above", from the URL alone you can see that it was a presentation from IETF's 109th meeting (https://www.ietf.org/how/meetings/109/). The link came from the proceedings (https://datatracker.ietf.org/meeting/109/proceedings), and if you search for this presentation there, you'll see that there's also a video of it.

If you want some context answering your first three questions, the subject is TCP congestion control, and there's a Wikipedia article about it (https://en.wikipedia.org/wiki/TCP_congestion_control). I don't know the answers to your last three questions, since I haven't been following the field; this presentation was mostly targetted to those who are following the field.

TCP congestion control is the algorithms that determine when to delay TCP packets and by how much, in the interest of getting maximum speed but not overloading routers and links. This is crucial to the survival of the Internet and to end-user experience, because without congestion control, everybody would send at their maximum rate all the time (subject only to the receiver's stated TCP receive window, which can be huge these days), and the Internet would collapse. Imagine sitting on a 2G connection in rural Africa, and the server at the other end blasts packets at 100 Gbit/sec on you. On the converse, imagine sitting at a 100 Gbit/sec connection, but the sender being overconservative and sending only at modem speeds...

Basically, this is a really hard problem, since you can measure bandwidth only indirectly (and with much delay), and the changes you do will affect the measurements. Solutions exist in the form of reacting to packet loss (ACKs don't come back, or selective ACKs come back to explicitly tell you something was missing), to ECN (routers along the path setting bits in your header to explicitly tell you it's getting tight on bandwidth) and to changes in delay (as measured by ACK timing and embedded TCP timestamps; higher delay typically means you're building up a queue somewhere, which is bad). They also differ in what rate they start at (initial window), how aggressive and how they increase the rate when things appear to go well (exponential, linear, binary search), and how they decrease the rate when they get negative signals (cut in half, other tactics). An important factor is “fairness”, usually to TCP Reno (a classic congestion algorithm from the 80s); if you and another sender send at the same time down the same path and they use some other congestion control algorithm, you should get about half of the bandwidth each. (An algorithm that's too aggressive could easily cause the other side to lose packets and just keep decreasing its bandwidth estimate, crowding it off the link entirely.)

A perfect congestion algorithm would immediately detect the correct amount of bandwidth (even in the presence of changing conditions and non-congestion-related packet loss such as a bad cable or Wi-Fi errors), send evenly at that rate, never send so much that packets are lost to congestion, never send so much that routers' queues build up and latency increases (bufferbloat) and be perfectly fair to all other algorithms.

As for census; it's interesting to know what exists out there, what people are doing in practice, and what algorithms we need to compete against and be fair to.

A perfect sending congestion control algorithm would need to trust the receiver, which we don't usually get, on the internets.

It is possible, in TCP, to lie and make the sender flood the route, by acknowledging packets not actually received. Protocols that allow requests for retransmission of ranges enable exploiting this loophole, but it does not seem often abused.

When you can trust the receiver implicitly, you can accept reports of end-to-end forward packet transmission time from it, and feed those to a predictor. When the time is increasing, the sending rate is too high; when decreasing, the rate could be higher. This allows ignoring packet loss as a signal, which is good because it is very noisy; it is caused by many things other than congestion.

Given a predictor, control theory provides easy means to tune sending rate exactly to the instantaneous capacity of the channel, and vary as other traffic begins and ends.

(Interestingly, round-trip time does not work for this. The channel carrying ACKs has completely different characteristics, so RTT is nearly useless. That does not stop people from trying!)

TCP, at its best, fails to use most of the channel, when RTT gets long. It is possible to adjust the sending rate in real time to consume exactly all of the usable bandwidth that TCP does not, so that TCP sees no effect from the better-controlled traffic, and is only slowed by sharing with other TCP streams.

The above is the basis for the Aspera file-transfer system, for which IBM still gets huge license fees from military, medical, extraction, and media customers.

> Basically, this is a really hard problem, since you can measure bandwidth only indirectly

What do you mean? My ISP knows which package I bought and how congested the tubes are, and how full the datatanks are.

The hop between your computer and your ISP is likely about 10-20% of the puzzle. Congestion control is necessary to find the optimal transmission rate across all the networks between your PC and the final destination (and back!).

Do you have any source for this? My preconception would be that either me-to-isp or server-to-their-isp would be bottleneck large portion of time.

Well let's take a look at the traceroute from my residential PC to Hacker News as an example:

  Tracing route to news.ycombinator.com []
  over a maximum of 30 hops:
    1     1 ms     1 ms    <1 ms
    2    12 ms    11 ms     6 ms  x.x.x.x
    3    29 ms    32 ms    24 ms  y.y.y.y
    4    19 ms    21 ms    22 ms  be23.clmkohpe02r.midwest.rr.com []
    5    28 ms    22 ms    22 ms  be27.clmkohpe01r.midwest.rr.com []
    6    19 ms    23 ms    19 ms  bu-ether31-vinnva0510w-bcr00.tbone.rr.com []
    7    33 ms    31 ms    20 ms  ae-13.edge2.Chicago10.Level3.net []
    8    68 ms    69 ms    69 ms  ae-0-11.bar1.SanDiego1.Level3.net []
    9    71 ms    75 ms    69 ms  M5-HOSTING.bar1.SanDiego1.Level3.net []
   10    69 ms    77 ms    66 ms  news.ycombinator.com []
Hop 1 is to my router

Hop 2 is my router to my ISP

Hop 3, 4, 5, 6 are all routers within my ISPs network.

Hop 7 is from my ISP into "the internet". Level3 is a major provider of interconnectivity.

Hops 8 and 9 are within Level3's network.

Hop 10 is finally to Hacker News, which indicates HN's "ISP" is Level3 directly.

Only hop 2 is where my "400 Mbps" package limitation is in place. Only hop 10 is "server-to-their-ISP".

I chose 10-20% as my intuition because, as you see, 10 routers were involved to get my packet from my PC to HN, and only 1 of them is constrained by my internet package.

Interesting that you can traceroute up to the destination. Over the years I have got accustomed that it's not possible any more in most cases. Some router is not sending or dropping the ICMP packets I guess. Not sure for what reason. To save the load? To stop people (competitors) spying on them?

news.ycombinator.com is no exception that it doesn't work:

   5  a9k2-49-infra-dc2.dc3.poneytelecom.eu (  0.745 ms (  0.865 ms a9k2-49-infra-dc2.dc3.poneytelecom.eu (  0.720 ms
   6  lag-110.ear3.Paris1.Level3.net ( 0.940 ms ae53.edge7.Paris1.Level3.net (  9.700 ms lag-110.ear3.Paris1.Level3.net (  0.955 ms
   7  ae-0-11.bar1.SanDiego1.Level3.net (  139.415 ms  139.085 ms  139.031 ms
   8  GIGLINX-INC.bar1.SanDiego1.Level3.net (  150.362 ms M5-HOSTING.bar1.SanDiego1.Level3.net (  140.286 ms  139.842ms
   9  * * *
  10  * * *
  11  * * *
  12  * * *
  13  * * *
(sorry, somewhat poor formatting caused by typing on the phone?)

Forwarding packets happens in ASICs, generating ICMPs is usually done by the CPU (which has a packet rate of maybe 1/100000th of the ASICs). So the latter is usually either rate-limited or turned off entirely, or you could bring the router to its knees by just spamming packets with too short TTL. It's also not uncommon that people turn it off to reduce external insight into their network, for whatever reason.

Right, but the weird thing was that grand parent got a full trace up to the target and I didn't as usual.

By experimenting a bit I found that when sending UDP packets the traceroute is incomplete, but when sending ICMP also my traceroute is complete. Explains how the grandparent could present a full traceroute.

I had just taken it as a fact that full traceroutes are rarely possible these days. For reasons like you say, it's overhead for them and they don't want you to do it.

Same using tracepath from another country

   6:  ae65.edge1.Stockholm1.Level3.net                     35.147ms asymm  7 
   7:  ae-0-11.bar1.SanDiego1.Level3.net                   457.605ms asymm 16 
   8:  M5-HOSTING.bar1.SanDiego1.Level3.net                199.025ms asymm 17 
   9:  no reply
  10:  no reply
  11:  no reply
  12:  no reply

found the answer

  -M icmp

You measured delay to get an ICMP packet back. Is that strongly correlated to packets being dropped (causing resend)? As you see even your example, sometimes the time even goes down when you move "further".

Not working the field, but I believe it's an incredibly hard statistical model, not that far away from random chaos for a remote observer in the general case.

Ok, but aren't those Level 3 pipes really really fat? And within ISP too? And besides, if they are overloaded, then packages will get rerouted iiuc?

Packets don't get routed based on congestion, no. And while fat, L3's pipes are not infinitely fat (imagine how much overcapacity they would need to pay for to never be congested at any time of the week/day!).

Even if the ISP were the only place congestion could arise, how would it tell the sender?

Not everything is intended for consumption by the general populace, especially when it comes to academic presentations. In not trying to be rude but either do the very minimum background reading if you’re interested or move on.

I don’t see the point in every presenter reinventing (probably poorly) the same “bringing you up to speed on core technology xxxx” in every presentation.

I really tried to make clear that I had no expectations for an introduction to be present, as it would be unreasonable to do so:

> they are not appropriate for a presentation held among peers that all mostly know the answers to the above

My comment was merely requesting that someone else provides such a summary, so that the linked material becomes accessible to a wider audience. I expect that the majority of readers here are not particularly familiar with this problem/field, so I thought it would be beneficial to have such a complementary explanation in the thread. (Yes, you can tell people to spend 10 minutes reading through Wikipedia and other supplementary material, but few will do so.) And indeed, I could get significantly more value out of this submission after reading some of the explanations I solicited. Others hopefully/presumably had the same experience (instead of just having to move on).

I didn't want to comment on the downvotes I received, but I would appreciate suggestions how my wording could've been improved to avoid the reaction that you had. It was precisely what I intended to curb with the quoted remark, but it seems it was not very effective.

OK, since you asked:

Your first paragraph ("sorely missing" "required to have") reads like a criticism. The first sentence of the last paragraph ("annoying to write") sounds like "I know you don't always enjoy eating vegetables, but...".

If you meant to just ask for a summary as a non-specialist, next time just say "Could someone explain the context of this to a layperson like me? Specifically, ..."

the wiki article is fine. however it too misses some context.

the internet has no flow control - any sender can issue as many packets as their egress link will bear. Since the subsequent links are over-provisioned, if the sum of the input rates exceeds the output rate, some packets may need to be dropped.

good or ill - the solution is to put a controller on the sender that examines the traffic statistics and tries to adapt its sending rate so as not to overflow the network capacity.

since everyone is doing this, we also kind of want to make sure that everyone sharing an output link adapts to something close to a fraction of that bandwidth divided by the total number of TCP flows. this is fairness.

without getting into the highly involved details you can already see that this is a really interesting control/game theory problem.

I find it particularly weird, astounding even, that this situation is really quite ideal for a tragedy of the commons. except there is no tragedy. everyone can cheat, but almost no one does. sure - you'd have to recompile your kernel, but anyone could ship 'network turbocharger 2.0' that forces all the other TCPs on the network to back off and allocate as much intermediate bandwidth as you like.

a very easy way to do this as Netscape discovered is to use multiple connections. the pieces are all the same size, but you can take as many as you like.

obviously that situation degenerates very quickly. but it hasn't. and it leaves the designers of wide area services that don't just use one TCP connection quite at a loss about how much bandwidth they should take.

as far as the survey goes, I can think of all sorts of reasons why the IETF would care. one reason is that uptake of new work is pretty hard - and has killed lots of great protocol work. I'm pretty surprised that BBR has such rapid uptake - the technical situation remains a little unclear.

for me the interesting part is the discussion of the measurement technique

> everyone can cheat, but almost no one does

Try running a site with big files for downloads. Every week or so, you'll be visited by someone who runs 200+ simultaneous connections for the same file, keeps opening more, and ignores HTTP errors (like “too busy”).

There are also the “IW wars”, where CDNs compete by increasing their initial window more and more to gain an edge. This will be exponentially worse with QUIC, when tuning these knobs becomes more accessible to non-experts.

everyone can cheat, but almost no one does

That's the point of "fair queuing". The idea is that if you send too much, you just compete with your own packets, and thus are not rewarded for sending too much. This discourages hogging the bandwidth. But fair queuing is computationally expensive, and is rarely seen in core routers, where some FPGA is doing routing. It's also not seen enough in home routers, where fast LAN hits slower uplink and a FIFO queue gets long, increasing latency. That's "bufferbloat". Quite fixable. Now get AT&T and Comcast to cooperate.

From the paper, "Classic congestion control questions on fairness guarantees and convergence need to be re-answered for the new rate-based congestion control paradigm." Yes. They're looking at performance for individual nodes.

This will be exponentially worse with QUIC

Or when Google uses it to improve their responsiveness vs. others?

The point is that QUIC (and related efforts) is a library, not implemented in the kernel. This makes it way easier to deploy, but also way easier (in practice) to change in bad ways.

CDNs also adapt the initcwnd based on the destination network the packets are being sent too

> I'm pretty surprised that BBR has such rapid uptake - the technical situation remains a little unclear.

As I mentioned in a sibling comment, BBR has high uptake among Google properties, but not necessarily much outside; many of the other big players are still using CUBIC, since it's tried and tested. Raw counts aren't a super useful metric, especially if many sibling domains (e.g. google.com, google.com.hk) can count toward that same number. That said, BBR is available as an option in the mainline Linux kernel, and I wouldn't be surprised to see at least a bit of cargo culting explaining its popularity ("it works for Google, so...").

(Alexa top 250 sites are no longer free to view, apparently, but the top 50 contain at least four google properties. Other website rankings: https://moz.com/top500 shows no fewer than 45 hits for "google", not to mention other properties such as Youtube or Blogger; https://trends.netcraft.com/topsites has 33 hits for "Google LLC" out of their top 100.)

I'm personally surprised by the widespread use of algorithms outside of Cubic, personally. Some of these are obviously in-house, e.g. Akamai CC, and there are some cases of, say, CTCP from Windows Server 2008 machines, but Cubic is the default on Linux, newer versions of Windows, and even macOS, which I assume covers a large majority of servers.

> As I mentioned in a sibling comment, BBR has high uptake among Google properties, but not necessarily much outside

As you said, it's available in linux behind a simple sysctl and here's a cloudflare blog post describing just that[0]. And netflix contributed the equivalent to freebsd[1]. And here's a blog post about AWS cloudfront seeing improvements from it[2]. TFA also mentions Steam.

So it's more than just google.

[0] https://blog.cloudflare.com/http-2-prioritization-with-nginx... [1] https://reviews.freebsd.org/D21582 [2] https://aws.amazon.com/blogs/networking-and-content-delivery...

To understand congestion control, we need to first understand congestion and congestive collapse. In a naive world, we might have every computer try to send all the packets at once (up to some buffer if you're working with a reliable protocol), but if you send more than fits on the network path, routers along the path _may_ buffer a few packets, but at some point they can't buffer any more, so they start dropping packets. If you have a reliable transport protocol such as TCP (which retransmits packets until the other side acknowledges them), then you'll retransmit all the packets that weren't acknowledged, thereby saturating the link again and again, ad nauseum.

Network congestion is a little bit like traffic congestion. There are a bunch of dissimilarities, but the important thing is that if there's a lot of congestion, then everything slows to a halt.

(My terminology will be a bit loose, since I assume laypeople aren't as interested in the nitty gritty details of the bandwidth-delay product, or the specifics of how congestion windows work.)

So, the primary job of congestion control is to determine how much capacity there is on the network, and to avoid going over that.

There are additional desirable properties, such as fairness (sharing the network evenly), or efficiency (sending as close to the capacity as possible), which other solutions might try to improve upon.

(For all of these, the core of the congestion control is in the steady state)

Some well known options in the space:

- TCP (New) Reno: a series of improvements that use at their core the principle of "additive increase, multiplicative decrease (AIMD)". Basically, you're always probing the network by increasing slowly, but when you see a packet drop, assume that it's due to congestion, and immediately halve the amount of bandwidth you're using. This performs well in terms of fairness, but has poor efficiency characteristics (see: "tcp sawtooth" which in the limit might achieve efficiency of ~75%)

- TCP CUBIC: the premise behind CUBIC is to improve upon New Reno's efficiency. Instead of linearly climbing back up, what if we probed via a cubic spline interpolation? Thus, we climb quickly at first, then slow down as we approach our previous sending rate (pre-drop), then speed up once we pass that limit in case there was a period of congestion that's since cleared up.

- TCP BBR: stands for "bufferbloat reduction". Bufferbloat is the increase in latency that may precede drops due to congestion, whereas, say, issues with a flaky wifi connection might not manifest in the same way. So, the goal is to detect congestion earlier than relying on packet drops (which are jarring but not always due to congestion). Previous attempts to take advantage of latency as an early detection signal such as Vegas had issues with adoption around fairness, in that they would back off more quickly than, say, New Reno or CUBIC, and thus would achieve lower throughput; BBR is more aggressive in some situations but less aggressive in others.

Why do we want a census? So we can see what's out there, what adoption rates look like, and how widely newer protocols are deployed in the greater ecosystem. For instance, BBR is called out as serving >40% of downstream internet traffic, but over 39% of downstream traffic is from Google properties using BBR (unsurprising, given that BBR came out of Google).

How have prior reports been acquired? 2011 report: https://cse.unl.edu/~xu/research/TCPcensus.html

There's a lot of existing work around inferring TCP congestion control (especially how the congestion window evolves in reaction to loss), and I'm sure this survey made use of a fair amount of the existing literature.

(I don't know how IETF decisions have been impacted in the past, so I can't answer that one. But I imagine that this helps inspire further research in the field.)

Thank you very much for this! With this background I can mostly make sense of the slides and appreciate the work behind them.

Edit: Also thanks to the other replies that have pointed out the interesting challenges involved. Don't want to comment on each one individually.

You asked good questions; and got great answers. great example of what this site is for if you ask me.

Small nit but BBR stands for bottleneck bandwidth and round-trip propagation time [1].

[1] https://queue.acm.org/detail.cfm?id=3022184

BBR, IIUC, fatally relies on round-trip time to probe the channel, which hardly ever works. Google probably can arrange to make it work between their data centers, but that doesn't help anybody else.

I explain elsewhere in this thread the optimal algorithm, that is sadly not compatible with TCP's conditions.

The paper supporting this is in another comment: https://news.ycombinator.com/item?id=25239684

Needs [PDF] but also it is really good

Netflix is either MulTCP (Renoish) or BBR on FreeBSD so they got that one completely wrong. Amazon Prime also has a 1/4 chance of coming from FreeBSD BBR via Limelight CDN. Anything on Limelight has a chance of being FreeBSD BBR depending on the customer otherwise FreeBSD CUBIC (with a ton of bugs). I wonder if AkamaiCC is still FastTCP.

Isn't this still calling into the pluggable CC framework, with all the CC_ calls? You could say that the various CCs are also using RACK on linux too.

These are beautiful slides. Is anyone able to tell me which program and/or packages were likely used to create them?

They're beautiful but I have a feeling that brown on white will wash out badly on many forms of projector or bright conference room environment.

Looks like the work of a professional graphic designer using indesign and the rest of the adobe creative suite. This format rendered in pdf is the current standard in the marketing /entertainment world for pitches as it gives the most control and best results for a graphic driven industry. Also serves as a litmus to keep riff-raff out who cant deliver on the level the industry is accustomed to. Seems to me the census hired or has access to a pro designer with agency experience and got this convention as a bonus.

Actually, the author seems to be a grad student in the field.

From downloading the PDF and skimming the raw bytes, I see:

    <pdf:Producer>Microsoft® PowerPoint® 2016</pdf:Producer></rdf:Description>
That said, it's always impressive to see a non-professional spend this much attention to detail.


Yup, a quick google shows that it is actually a variant of the built-in "Integral" theme in Microsoft PowerPoint.

Thanks! I was so sure this just _couldn't_ be PowerPoint that I didn't even bother to check.

I suppose the lesson here is that talent matters more than the tool.

Strange, I've found Powerpoint default themes to be very pretty and professional.

Yes it is and Im always glad to be proven wrong!

Sincere thanks for the research!

the author's homepage is pretty swish https://www.comp.nus.edu.sg/~ayush/

Can someone elaborate what BBR G1.1 entails? Quick googling just points to this presentation.

it's explained in section 4.2 of the paper https://www.comp.nus.edu.sg/~bleong/publications/sigmetrics2...

afaik 1st-gen bbr is somewhat unfair to other cc-algos to the point that it will starve them. use it if you want to be 'that guy' (or google :)

update: slide 27 and onwards https://www.potaroo.net/presentations/2018-05-10-bbr.pdf

So basically a meeting to discuss the "Two Generals Problem" which has been proofed to be unsolvable. Unless there is some serious detail nothing much to see here.

There are many problems which are provably unsolvable in the general case but it is nevertheless of interest to see how well we can do in spite of that fact, because they are useful problems to solve even approximately.

A survey of internet properties and how they each solve this problem for their own domains seems reasonable to consider as "serious detail."

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact