TCP BBR congestion control comes to GCP

atomt · on July 20, 2017

A warning if you want to try out BBR yourself:

Due to how BBR relies on pacing in the network stack make sure you do not combine BBR with any other qdisc ("packet scheduler") than FQ. You will get very bad performance, lots of retransmits and in general not very neighbourly behaviour if you use it with any of the other schedulers.

This requirement is going away in Linux 4.13, but until then blindly selecting BBR can be quite damaging.

Easiest way to ensure fq is used: set the net.core.default_qdisc sysctl parameter to "fq" using /etc/sysctl.d/ or /etc/sysctl.conf, then reboot. Check by running "tc qdisc show"

Source: bottom note of https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

jsolson · on July 20, 2017

A related warning: your NIC driver can break the qdisc layer (or at least mislead it).

Prior to 4.12 the VIRTIO Net driver always orphans skbs when they're transmitted (see start_xmit()). This causes the qdisc layer to believe packets are leaving the NIC immediately until you've completely filled your Tx queue (at which point you will now be paced at line rate, but with a queue-depth delay between the kernel's view of when the packet hit the wire and when the packet actually hit the wire).

After looking at the code -- even in 4.12 enabling Tx NAPI still seems to be a module parameter.

(I'm not sure which other drivers might have the same issue -- my day job is limited to a handful of devices, and mostly on the device side rather than the driver side)

atomt · on July 21, 2017

That is good to know. I just deployed BBR on some pilot virtio backed VMs yesterday and I missed this.

As far as I can tell the Actual Hardware I'm running my other BBR pilots on are doing the right thing.

File under: BBR - still a gotcha or two ;-)

atomt · on July 20, 2017

To try it out, make sure that your Linux kernel has:

CONFIG_TCP_CONG_BBR

CONFIG_NET_SCH_FQ (not to be confused with FQ_CODEL)

Put these into /etc/sysctl.conf:

net.core.default_qdisc=fq

net.ipv4.tcp_congestion_control=bbr

Reboot.

sp332 · on July 20, 2017

I haven't tested this, but you should be able to sysctl -p to reload the config instead of rebooting.

atomt · on July 20, 2017

Just loading the sysctl values will not switch the packet scheduler on already existing network interfaces, but it will start using BBR on new sockets.

Switching the scheduler at runtime using tc qdisc replace is possible, but then you need to take extra care if the device is multi queue or not. Instead of explaining it all here just rebooting is probably simpler.

nh2 · on July 20, 2017

BBR is, in my opinion, one of the most significant improvements to networking in recent years.

I've started using it on a couple of long-range routes (e.g. Switzerland to Ireland, Frankfurt to Singapore) with Gigabit servers on the Internet, and it turns unreliable ~200 Mbit/s transfer rates into reliable > 850 Mbit/s.

And all that's needed is `sudo modprobe tcp_bbr && sysctl -w net.ipv4.tcp_congestion_control=bbr`.

Great job really!

aexaey · on July 21, 2017

No need for modprobe on a system with sane udev: Just sysctl should be sufficient.

morecoffee · on July 21, 2017

Dumb question: the remote side need to enable it too, right?

manigandham · on July 21, 2017

Doesnt seem like it: https://news.ycombinator.com/item?id=14814206

signa11 · on July 21, 2017

> Dumb question: the remote side need to enable it too, right?

umm, no.

nealmueller · on July 20, 2017

Today’s Internet is not moving data as well as it should. TCP sends data at lower bandwidth because the 1980s-era algorithm assumes that packet loss means network congestion.

BBR models the network to send as fast as the available bandwidth and is 2700x faster than previous TCPs on a 10Gb, 100ms link with 1% loss. BBR powers google.com, youtube.com, and apps using Google Cloud Platform services.

Unlike prior TCP advancements like TCP QUIC which required a special browser, BBR is a server-side only improvement. Meaning you may already be benefiting from BBR without knowing it. BBR requires end users to make no improvements. This is especially relevant in the developing world which use older mobile platforms and have limited bandwidth.

YZF · on July 20, 2017

There have been a lot of modifications to TCP since the 1980's to allow it to push a lot more bandwidth on faster networks. Most notably perhaps window scaling.

How does BBR avoid killing other streams that happen to share the same pipe? It seems it would consume more than its fair share if the other TCP streams are using older algorithms.

p.s. presumably if you get 1% loss with no congestion there's wireless/mobile involved?

wbl · on July 20, 2017

BBR uses mode switching to learn what the latency is and what its fair share bandwidth is.

YZF · on July 20, 2017

Do you know if any experimental results of sharing with the other congestion avoidance flavors are available somewhere? Historically this requirement for backwards compatibility has been a big problem. Maybe YouTube is getting better but other web traffic is getting hosed?

wmf · on July 20, 2017

BBR yields to CUBIC or Reno somewhat. https://www.ietf.org/proceedings/97/slides/slides-97-iccrg-b... slide 23 https://www.ietf.org/proceedings/98/slides/slides-98-iccrg-a... slides 16-18

YZF · on July 20, 2017

Cool. Thanks! I worked on a UDP congestion avoidance algorithm that had bandwidth/latency feedback built into the protocol and had to deal with some of the same issues.

signa11 · on July 20, 2017

here is the acm-queue article/paper on the same thing:

https://cacm.acm.org/magazines/2017/2/212428-bbr-congestion-...

edit-01:

some more sources of information

ietf drafts on the same topic available here:

https://tools.ietf.org/html/draft-cardwell-iccrg-bbr-congest...

https://tools.ietf.org/html/draft-cheng-iccrg-delivery-rate-...

and a blog post giving a detailed history of various congestion control mechanisms, and bbr as well:

https://blog.apnic.net/2017/05/09/bbr-new-kid-tcp-block/

kyrra · on July 20, 2017

Link is broken. Is it this?

http://queue.acm.org/detail.cfm?id=3022184

or this

https://cacm.acm.org/magazines/2017/2/212428-bbr-congestion-...

signa11 · on July 20, 2017

thanks for the heads up, it is the latter. i have updated the info.

voltagex_ · on July 21, 2017

Looks like it's being tested inside Netflix on FreeBSD: https://wiki.freebsd.org/TransportProtocols/26Jan17

floatboth · on July 21, 2017

Nice. Came into this thread thinking "hmm I wish someone ported this to FreeBSD"… of course it's Netflix :D

eikenberry · on July 20, 2017

For those interested in a simple guide on how to try it on your servers, this is the most to the point I've found.

https://www.admon.org/networking/update-linux-kernel-to-enab...

theandrewbailey · on July 20, 2017

CentOS 7 only.

netheril96 · on July 21, 2017

As a Chinese whose international bandwidth has been abysmal, BBR has been a godsend. The difference in speed when I turn to BBR on my shadowsocks server is astronomical.

_tefa · on July 20, 2017

> When a GCP customer uses Google Cloud Load Balancing or Google Cloud CDN to serve and load balance traffic for their website, the content is sent to users' browsers using BBR. This means faster webpage downloads for users of your site.

This makes it sound like BRR is only available for Google-managed services on GCP, is that correct? Can I use BRR on GCE servers (which can install the kernel module)? Seems like an odd thing to leave out.

nealcardwell · on July 20, 2017

Yes, you can use BBR inside a VM in GCE. Here is a quick-start guide if you are interested in doing that:

https://github.com/google/bbr/blob/master/Documentation/bbr-...

jsolson · on July 20, 2017

Note that, in addition to Neal's instructions, you may want to load virtio_net with napi_tx=true

This makes virtio_net play more nicely with the qdisc layer. GCE requests moderately deep Tx queues 4096 -- without the module param you can have up to a 4096 packet delay between actual and as-seen-by-qdisc Tx times.

http://elixir.free-electrons.com/linux/v4.12.2/source/driver...

orware · on July 20, 2017

Over the weekend I setup one of the new consumer mesh products that's available, the Linksys Velop, with 9 nodes covering a good sized area between two homes.

One thing I've been noticing though is that there is considerable latency/packet loss at the moment (there is only one wired backhaul at the primary node and all of the other nodes are connected to each other wirelessly).

I've been running Ping Plotter against to all of the nodes and there seems to be considerable packet loss (a few percent) and spikes in latency (average for the two closest nodes to my laptop is about 15 ms, the middle ones out a ways are about 30-40 ms, and the furthest ones are at about 60 ms) but the spikes can be in the hundreds or even thousands of ms.

The area covered is about a 500 ft by 120 ft rectangle more or less (with my house on the bottom left of that rectangle and the other home on the bottom right of that rectangle).

My question would be...would this BBR algorithm help in some way to reduce the latency/packet loss in a situation like this? Or does it only apply for these other situations that Google would normally be encountering/dealing with?

Thanks for the input!

sbierwagen · on July 20, 2017

Sounds like just bad physical-layer connectivity, nothing to do with TCP.

Most of my experience with wifi mesh is from years ago, with pure 2.4ghz stuff, back when it basically didn't work at all. How close are the nodes? Are there any long multi-hop chains? (Repeater talking to repeater talking to root. The more hops, the worse it works)

mangix · on July 20, 2017

BBR solves a different problem. The problem you have is wifi being terrible. There's massive buffering in the firmware which you can't get rid of.

adrianmonk · on July 20, 2017

That doesn't seem like all that different of a problem. I'm certainly not an expert on BBR, but from reading the description, the design goals seem to explicitly include dealing with buffers better (by making efforts to not fill them to the max) and being less skittish about packet loss.

Specifically, the description (in the git commit) says it has a "congestion control algorithm that reacts to actual congestion, not packet loss or transient queue delay" and that it estimates the size of the queue it probably created and paces packets in order to "utilize the pipe without creating excess queue". (And that last part addresses large buffers. Even if they can't be turned off, they only fill up if you send enough data to fill them.)

Obviously it isn't magic and there is only so much any algorithm can do in the face of a cruddy physical network layer, but the traditional algorithm makes a bad situation much worse than it has to be, so there is still the potential for a newer algorithm (like BBR) to make a big improvement.

Anyway, more to the point, cruddy wifi is what a lot of people use to browse the web, so it's not surprising to me if Google tried to account for that in their design.

orware · on July 20, 2017

In this case do you mean the firmware being provided by Linksys for the hardware, or an additional layer of firmware embedded into the wireless hardware in some way? It actually looks like for the Velop they are using some form of OpenWRT it looks like from what I can see when I pull up the sysinfo page in the router so it makes it seem like they would have the ability to customize/tweak the buffering settings in some way (I took Georgia Tech's Networking class about two years ago now, but it was pretty neat to learn about the buffer bloat problem there in that course and how they were mentioning bigger buffers weren't necessarily better for performance).

wtallis · on July 21, 2017

The radio's firmware, not the Linux OS running on the application processor. All 802.11ac radios have closed-source firmware even when there are open-source Linux drivers to communicate with the NIC. The 802.11n chipsets by Atheros didn't use proprietary firmware and exposed a fairly low level interface to the host system. This led to the open-source ath9k Linux driver being the platform of choice for people trying to fix WiFi in general or improve the Linux WiFi stack in specific.

Animats · on July 20, 2017

If you have significant page load delays with Wordpress sites, it's probably not a TCP-level problem.

losvedir · on July 21, 2017

Ha, as soon as I saw this I was hoping you were going to chime in!

May I ask if you have any thoughts on BBR? In what ways is networking different from when you published yours that might warrant (or not!) another congestion control algorithm?

est · on July 21, 2017

well, neither CUBIC nor RENO can move your 10MB javascript over a poor connection as efficient as BBR.

MertsA · on July 21, 2017

Bah, what would you know about congestion control in IP/TCP.

kuschku · on July 20, 2017

So, what would be required to use this outside of GCP? All documentation on BBR only discusses GCP.

est · on July 21, 2017

1. install 4.10+ kernel

2. echo "bbr" > /proc/sys/net/ipv4/tcp_congestion_control

3. save it to sysctl.conf

4. restart and you are done.

MertsA · on July 21, 2017

> echo "bbr" > /proc/sys/net/ipv4/tcp_congestion_control

[...]

> restart and you are done.

You only need one or the other. Also, you could just reload your sysctl.conf instead.

_vvdf · on July 20, 2017

I'd much prefer to see GCP add IPv6 support, which is sorely lacking.

wsh91 · on July 20, 2017

Have a look at https://cloud.google.com/compute/docs/load-balancing/ipv6.

(Disclaimer: I work on GCP, albeit not on networking stuff.)

daurnimator · on July 20, 2017

That's not helpful for a wide range of use cases. The most recent one I ran into was running an irc server that would be compatible with the matrix irc bridge, see https://github.com/matrix-org/matrix-appservice-irc/issues/2...

gafferongames · on July 20, 2017

This is great. Now please solve the head of line blocking for time critical data.

lern_too_spel · on July 20, 2017

That's solved in HTTP2 and other connection multiplexing protocols.

manigandham · on July 20, 2017

HTTP/2 runs on top of a single TCP connection so it's still vulnerable to TCP ordering requirements.

TCP will probably never have a mainstream solution to this, better to switch to UDP or QUIC instead.

morecoffee · on July 21, 2017

While I agree QUIC is a better long term solution[1], saying TCP ordering affects HTTP/2 is misleading. It is true, but it is quite easy to avoid bad behavior by using TCP_NOTSENT_LOWAT (also created by Google), to avoid HoL blocking. For example, SPDY had a similar problem which was ameliorated by only sending when the water mark is low enough:

https://insouciant.org/tech/prioritization-only-works-when-t...

[1] https://news.ycombinator.com/item?id=12282898

vitus · on July 20, 2017

Only sort of. You still have some HOL blocking problems with HTTP2 because you're still running on top of TCP. I recall this explicitly being one of the selling points of QUIC.

First resource I found on this matter from a quick search: https://engineering.salesforce.com/the-full-picture-on-http-...

kev009 · on July 22, 2017

anonymousDan · on July 20, 2017

Sounds like it would be great for wireless ad hoc networks.

abpavel · on July 21, 2017

A smack in the face of net neutrality, because the protocol hogs the bandwidth at expense of all other traffic.

It's like putting a tank on a standard regulation road, and boasting about how well it performs in the standard city congestion environment, because you simply can roll over other cars and trucks and run the red lights.

Beauty of TCP Classic (i.e. Reno) is an ability to scale by fairly sharing bandwidth among flows. When one party is aggressively trying to utilize as much of the bandwidth as possible, it is no longer fair, and will simply force netadmins to classify Google's protocols into more aggressive queues in private, and supply fuel against net neutrality in public.

manigandham · on July 21, 2017

That's a stretch - how do you know this? Especially since BRR has already been live on google/youtube for a while?

This thread has some comments about it: https://news.ycombinator.com/item?id=14814616

The kernel commit also has a note: https://github.com/torvalds/linux/commit/0f8782ea14974ce9926...

"It can operate over LAN, WAN, cellular, wifi, or cable modem links. It can coexist with flows that use loss-based congestion control, and can operate with shallow buffers, deep buffers, bufferbloat, policers, or AQM schemes that do not provide a delay signal."

kilgariff · on July 21, 2017

http://queue.acm.org/detail.cfm?id=3022184

Correct me if I'm wrong, but I think it would work like this:

Say client A (loss-based) and client B (BBR) are on the same congested network:

A would fill the bottleneck buffer until the buffer overflows, then back off quickly due to the high number of dropped packets. This creates a sawtooth-like pattern of gradual ramp-up and sharp falloff.

B would detect the bottleneck bandwidth and the RTT, so it knows to back off before the bottleneck buffer overflows. Then, while A is slowly ramping up again, B would detect that there's no congestion and send more traffic. B would then gradually back off as A fills the queue again, and so on.

If this is right, then BBR would co-exist well enough with connections with loss-based algorithms.