
Bottleneck Bandwidth and RTT - 0123456
https://patchwork.ozlabs.org/patch/671069/
======
Animats
This is useful mostly for long-lived connections. As Google moves away from
the many-short-connections HTTP model to the persistent connections of HTTP/2,
connections live long enough that both bandwidth and delay can be discovered.

This is much better than trying to estimate bandwidth from packet loss.

~~~
AdamJacobMuller
Any idea of how to define "long lived" ?

Excluding the HTTP/2 situations, obviously if you're fetching a single small
resource (image or something) that takes <1s then that's short, but, where's
the line there? Is something >1m long?

~~~
Animats
10 seconds in this code.

    
    
        static u32 bbr_min_rtt_win_sec = 10; /* min RTT filter window (in sec) */ 
    

The code also uses a moving average over 10 round trips. So that's what the
filter needs to get a stable estimate. A point in the paper [1] is that this
method is said to work well for maintained TCP connections with idle periods.
That means HTTP/2 in practice.

It would be interesting to see test data on this for large numbers of real
connections. How much do bandwidth and delay vary in practice across ISP
links, cable headends, and cellular links?

[1]
[http://caia.swin.edu.au/cv/dahayes/content/networking2011-cd...](http://caia.swin.edu.au/cv/dahayes/content/networking2011-cdg-
preprint.pdf)

~~~
apenwarr
The averaging period says more about how fast they expect min_rtt to change
than how long it takes to calculate it. If the rtt is fairly stable, that
measurement could be accurate much sooner than 10s.

Super short lived sessions can really only go faster with tricks like
increasing the initcwnd. Anything longer than that, I'd expect bbr to work
well.

------
kev009
This is the same Van Jacobson who was instrumental in working through TCP
congestion collapse growing pains over 30 years ago
[https://en.wikipedia.org/wiki/Network_congestion#Congestive_...](https://en.wikipedia.org/wiki/Network_congestion#Congestive_collapse)

------
jnordwick
Intra network, TCP slow start is often turned off to minimize latency on links
especially ones that you have to respond very quickly to data initially or
that is bursty.

Google BBR seems to used the same exponential probing that slow start does, so
I wonder how it will perform when you are staying in network and don't often
have to worry about packet loss or congestion and want the link to start off
at full throttle.

Once BBR enters its steady state it intentionally cycles faster and slower,
but this seems like it is creating additional latency when you don't want it.
Think of a traffic burst that happens just as the link decides to cycle
slower.

It also seems like the protocol intentionally runs slower that possible as to
not create buffer pressure on the receiving side, if I'm understanding this
quick description properly: "then cruising at the estimated bandwidth to
utilize the pipe without creating excess queue".

The this line just scares me: "Occasionally, on an as-needed basis, it sends
significantly slower to probe for RTT (PROBE_RTT mode).

Google is going to make patches that work for them, but that doesn't always
mean it will work for everybody else. This seems very close tailed to Google's
traffic issues and serving HTTP over persistent connections, and not a general
purpose feature, think of games, intranetwork low-latency applications, etc.

~~~
YZF
Right. I worked on congestion avoidance for a UDP based protocol and it's
really difficult to get this right because you have to coexist with lots of
other algorithms on the same network. If you are "fighting" a loss based
algorithm TCP stream it's virtually impossible for you to get your fair share
of bandwidth without getting packet loss.

An algorithm for efficient local networking where everything is under you
control is very different than something that runs of the Internet. Not even
sure that this style of congestion avoidance is the best approach for a
tightly controlled local network.

Edit: and keep in mind that packet loss is correlated with queues filling up.
So as long as there's lots of loss based algorithms in the wild it's difficult
for someone to come in with a better solution that coexists with those other
flavours (at least if you're not allowed to touch the "network" itself).

~~~
jamesblonde
We are also working on a congestion avoidance for a UDP based protocol - our
goal is to share large volumes of data over the Internet but have high
throughput for high latency links and support NAT traversal. So no UDT. We're
using a varient of LEDBAT (utp/micro-torrent). What are your negative
experiences based on?

~~~
YZF
It's not really a negative experience. It's just that it can be difficult. In
general I think it's useful to think as the bandwidth of a TCP stream being a
function of loss and latency (and connection time, at least initially, less of
a factor for long lived active connections). The different variants of TCP
have slightly different curves. You want your UDP algorithm to play nice with
these otherwise you may use more than your fair share or you'll get hammered
by competing TCP streams. Apologies if this is kind of basic :)

------
otoburb
Many TCP optimization algorithms report their performance improvements using
CUBIC as their baseline. Will be very interesting to see how TCP optimization
vendors adapt to the new Bottleneck Bandwidth & RTT patch.

From an industry viewpoint, I wonder how this will perform over traditionally
higher-latency and higher-loss wireless networks.

As an aside, I love how small the patch is, weighing in at 875 LOC including
comments.

------
stephen_g
This is very exciting, and I can't wait to see some performance data from it.
Bufferbloat is a huge problem so it's awesome to see work being done in this
area. It's really cool also that it can improve things just by patching
servers!

How does this interact sending traffic through routers using algorithms like
fq_codel to reduce bufferbloat? Is it better to just have one or the other or
do they work well together?

~~~
bcook
CoDel is meant to solve bufferbloat at a single network node, not along the
entire path.

TCP BBR, as a consequence of more accurate congestion avoidance, seems to
(hopefully) reduce bufferbloat _along the entire path_.

Both of these algorithms should work together well since they do not really
compete. (assuming that I understand...)

------
mhandley
The linked article doesn't provide enough information to understand how
they're using Acks to probe bottleneck bandwidth, but they've prior work on
this. If it's similar to Van Jacobson's pathchar, I would have thought there
might be "interesting" interactions with NICs that do TCP TSO and GRO (using
the NIC to do TCP segmentation at the sender and packet aggregation at the
receiver), as these will mess with timing and, in particular, how many acks
get generated. Still, the authors are very well known and respected in the
Networking community, so I expect they will have some way to handle this.

~~~
apenwarr
It doesn't need to be as complicated as pathchar, because of a lucky fact: if
you measure the end to end bandwidth, you are always measuring the bottleneck
bandwidth :) So you can just count how many packets arrive at the other end
between two time points and call that the bottleneck bandwidth.

------
dedalus
The key point here is the deployment model: only change sender, no network or
receiver side change. This means if you see even 5% improvement people are
going to deploy it. Also the key architect seems Neal Cardwell and simply
using Van Jacobson as a signoff/vetting authority. Eric Dumazet has long done
a lot of work on TCP pacing (look at his prev patches)

Basically all this means is that we have a form of TCP Pacing that can work
(but suffers from classic prisoner's dilemma)

~~~
tarrga
I work at Google in Bandwidth-SRE; watched this as it developed. Van's name is
not on this patch as an appeal to authority --- he contributed meaningfully to
the design. The analysis to support unloaded vs. loaded probes (to measure
time-separated variables) was entirely his.

------
_RPM
This is really interesting work. I wish I was smart enough to do this kind of
stuff.

~~~
jeff_marshall
There is nothing magic about networking! If you are interested, start reading
things and playing around. The issue, like most complex subjects, is that most
people aren't interested enough in the details to get to the point where they
are confident enough to do something innovative and present it to the world.

~~~
_RPM
I've done a little bit of research on the team that implemented this and they
are mostly PhD's. I was never that strong of academic outside of my CS
classes. I just didn't care about anything but CS, and spent more time
programming than I probably should have in my undergraduate years. Years later
I suffer the consequence of that by having a lower GPA than most grad students
would.

~~~
Cthulhu_
GPA is just a number, and a snapshot taken at a certain (early) time in your
life and development at that. The PhD's most likely have at least 10 more
years of experience than you, and their achievement - the research you read -
is again only a snapshot of their life and level at that point in time.

The impostor syndrome (which you seem to exhibit signals of) is mainly caused
by only seeing the higher steps, and forgetting the steps you've already
taken.

------
apenwarr
My favourite feature of this algorithm is it ought to fix the problems of very
fast TCP over very long distances when there is nonzero packet loss. I think
:) Traditionally recovering from packet loss can take several RTT, and if
round trips are long, it might never get up to full speed. In theory, I think
this will fix it.

------
wscott
Very interesting, but I suspect like many of these it works best if you are
only competing with only connections using the same approach. Which is
probably why Google is talking about using this inside their backbone. This
estimates the queues in the network and tries to keep them empty. TCP ramps
until packets start dropping, so it spoils things for the BRR connection.
Perhaps combined with Codel on to drop TCP packets early the two could play
nicely together.

Hmm, reading the code it says it does play well with TCP, but "requires the fq
("Fair Queue") pacing packet scheduler." In fact, later it says it MUST be
used with fq. Hmm.

BTW the code is very readable and well commented.

~~~
apenwarr
I think the need for fq is just because that's how Linux does pacing, and
pacing is essential for the accuracy of measurements used by bbr.

------
nieksand
Has anybody managed to find a preprint of the ACM queue paper floating around?

------
quietplatypus
The name is quite misleading for those not familiar with previous work.

How's this different from TCP Vegas and FAST TCP which also use delay to infer
the bottleneck bandwidth?

~~~
apenwarr
bbr does not infer bandwidth using delay; it just measures the amount of data
asked between two points in time (ie. it directly measures the bandwidth). It
also directly measures the rtt. The big insight in bbr is that these two
values are all you need (and you convert them into a "pace" and a window
size); unlike eg. Vegas, you don't slow down just because latency increases,
you slow down when measured bandwidth decreases. This makes it far more
resilient, especially to other competing TCP flows.

------
falcolas
Is anyone able to speak to the compute and memory overhead this requires, in
comparison with the loss-based algorithm? I ask on behalf of firewalls and
layer 4 routers everywhere.

Can this really just be patched in, with no changes to specalized hardware?

~~~
Filligree
It's an endpoint-side algorithm. This goes on whatever terminates the TCP
connection; servers, PCs, etc.

Firewalls shouldn't be affected, and switches won't be. On behalf of everyone
forced to use L4 switches, though, can you please reduce the buffer sizes? :P

