
Employing QUIC Protocol to Optimize Uber’s App Performance - nhf
https://eng.uber.com/employing-quic-protocol/
======
ctime
This YouTube video does a great job illustrating how well HTTP/2 works in
practice.

[https://www.youtube.com/watch?v=QCEid2WCszM](https://www.youtube.com/watch?v=QCEid2WCszM)

A lesser known *ownside to HTTP/2 over TCP solution was actually caused by one
of the improvements - a single reusable (multiplexed) connection - that could
end up stalled or blocked due to network issues. This behavior could go
unnoticed over the legacy HTTP/1.1 connections due to browsers opening a hugh
number of connections (~20) to a host, so when one would fail it wouldn't
block everything.

~~~
youngtaff
It's an unrealistic test as pages aren't really made up of tiles of images

Generally images are the lowest priority download, so ensuring higher priority
items get downloaded first is important and not all H2 implementations do it
well

[https://ishttp2fastyet.com](https://ishttp2fastyet.com)

~~~
baroffoos
Its only unrealistic because so much tooling was built to avoid sending
multiple files. JS tools for bundling every js file in to one, sprite sheets,
using multiple domain names to get more concurrent connections. With HTTP2 we
could dump so much of this.

~~~
youngtaff
To me it's unrealistic in the sense that it's an artificial test

Images are the lowest priority resource on the page and apart from the visual
appearance aspects there are no dependencies on which order they're fetched
in.

Most other resources on a page have greater side effects and and dependencies
e.g. sync JS blocking the HTML parser, sync and deferred JS need to be
executed in order etc.

You can saturate a last mile connection with image downloads in a way you man
not be able to with other resources due to effect of the browser processing
those resources.

------
internals
What a great case study. Successfully shifting 80% of mobile traffic to QUIC
for a 50% reduction in latency is amazing. QUIC and the ongoing work with
multipath TCP/QUIC will be huge QoL improvements for mobile networking.

~~~
api
The other awesome thing about QUIC is that it encrypts almost everything
including header information, making middlebox traffic shaping worthless and
demoting middleboxes in general.

~~~
drewg123
It also makes hardware offloads like TSO and LRO impossible, and increases
cost-per-byte served by a factor of 4 or more. So if you have infinite CPU to
throw at QUIC and/or low bandwidth or connection targets, its great. If you
are concerned at all about server-side efficiency, its terrible.

FWIW, I work on the Netflix CDN, and specialize in server-side efficiency; we
have had 100G flash CDN nodes for years serving at 90G+ in production. None of
that would be possible with QUIC as it stands. I suspect our max B/W on these
machines would drop from ~95Gb/s to 20Gb/s or less if we were to switch to
QUIC.

~~~
scott00
Does the protocol actually make it impossible, or is it just not implemented
by current OS/hardware?

~~~
vlovich123
Definitely just that the HW hasn't caught up.

[https://www.netdevconf.org/0x13/session.html?talk-quic-
offlo...](https://www.netdevconf.org/0x13/session.html?talk-quic-offload)

------
panarky
Experiment 1

 _While we used the NGINX reverse proxy to terminate TCP, it was challenging
to find an openly available reverse proxy for QUIC. We built a QUIC reverse
proxy in-house using the core QUIC stack from Chromium and contributed the
proxy back to Chromium as open source._

Experiment 2

 _Once Google made QUIC available within Google Cloud Load Balancing, we
repeated the same experiment setup with one modification: instead of using
NGINX, we used the Google Cloud load balancers to terminate the TCP and QUIC
connections...

Since the Google Cloud load balancers terminate the TCP connection closer to
users and are well-tuned for performance, the resulting lower RTTs
significantly improved the TCP performance._

------
esaym
I recently moved and got internet with Spectrum. A 200/10 service yet my
upload speeds were rarely above 5mbit. This was a business account with some
web and dev servers behind it. I didn't even try to call customer service...

With a little more testing using UDP, I could see I was getting very spotty
packetloss (<0.5%). I'd never tried changing the TCP algo before but I knew
random packetloss is normally interpreted as congestion and hence causes a
speed backoff.

I tried all of the ones available at the time but the one that stood out not
only in performance but also simplicity was TCP-Illinois[0]. The stats
provided by `ss -i` also seemed the most accurate with TCP-Illinois. I force
enable it on every machine I come across now.

0:[https://en.wikipedia.org/wiki/TCP-
Illinois](https://en.wikipedia.org/wiki/TCP-Illinois)

~~~
nullwasamistake
I highly recommend BBR congestion control if your router supports it.

~~~
lmns
TCP congestion control is end-to-end. Routers don't need any support for it.

~~~
mirashii
This previous discussion will probably help here.
[https://news.ycombinator.com/item?id=14298576](https://news.ycombinator.com/item?id=14298576)

------
m3kw9
Tcp was build for the internet long ago, even though there are changes added,
the architecture of the protocol make it hard to do anything drastic. With UDP
because it is so simple, you can basically create a new protocol on top,
inside the payload and emulate TCP if you wanted to

------
sly010
I wish the mandated minimum MTUs of IP were just a bit bigger. Ubers traffic
must be so transactional, they could really just use individual UDP packets
for most messaging.

~~~
jacob019
ipv4 - 576 bytes

ipv6 - 1280 bytes

~~~
mruts
That should be sufficient for uber, I think.

~~~
sly010
>>> print len("Dear Uber, I need a ride from 26th and 6th Street, Brooklyn to
999 Broadway. I am on the south side of the road. It's 6:99PM UTC, I have 55%
battery, my public key is b82c9238e847b. I like Jazz. My phone is a Pixel 3, I
am using app version 4.22.4, but you probably already knew that. I am running
out of things to add, so here is some personally identifiable information...
and we still have plenty of bytes left")

Edit: formatting

~~~
heinrich5991

          File "<stdin>", line 1
            print len("Dear Uber, I need a ride from 26th and 6th Street, Brooklyn to 999 Broadway. I am on the south side of the road. It's 6:99PM UTC, I have 55% battery, my public key is b82c9238e847b. I like Jazz. My phone is a Pixel 3, I am using app version 4.22.4, but you probably already knew that. I am running out of things to add, so here is some personally identifiable information... and we still have plenty of bytes left")
                    ^
        SyntaxError: invalid syntax

------
7ewis
So is this essentially HTTP/3?

~~~
wmf
Yes. [https://tools.ietf.org/html/draft-ietf-quic-
http-20](https://tools.ietf.org/html/draft-ietf-quic-http-20)

------
the8472
Isn't TLP[0] supposed to fix the largest cause (tail losses) of this issue? It
should result in retransmits far sooner than the 30 seconds they mention.

> Recently developed algorithms, such as BBR, model the network more
> accurately and optimize for latency. QUIC lets us enable BBR and update the
> algorithm as it evolves.

Again this is available for TCP in recent linux kernels[1]. And it's sender-
side, so it should be unaffacted by ancient android devices.

Are they using ancient linux kernels on their load balancers? Or are the
sysctl knobs for these features turned off in some distros?

[0]
[https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6ba8a3b19e764b6a65e4030ab0999be50c291e6c)
[1]
[https://kernelnewbies.org/Linux_4.9#BBR_TCP_congestion_contr...](https://kernelnewbies.org/Linux_4.9#BBR_TCP_congestion_control_algorithm)

~~~
lossolo
BBR in kernel works for TCP because it's used for TCP congestion control. QUIC
is built on top of UDP and there is no congestion control used for UDP, you
implement it in your application/protocol built on top of UDP. From what I
remember QUIC implementation supports BBR and Cubic as congestion control
mechanisms.

~~~
v5c6
Correct. QUIC also implements TLP. Main advantage is deployability since this
is a user-space solution.

------
ssvss
I thought DDOS prevention was difficult with udp, compared to TCP. Is it not
the case anymore. Does cloudflare provide DDOS prevention for QUIC/UDP.

~~~
mirashii
UDP DDOSes are hard to prevent against because they're volumetric in nature.
Generally, they rely on UDP's statelessness, poorly configured networks, and
various applications that respond with significantly more data than requested
over UDP. These types of DDOSes are useful against any service that's running
on TCP or UDP, and Cloudflare has protected against them all along.

------
jefftk
I'm surprised the "alternatives considered" section doesn't have a "write
something custom for core functionality using UDP". I would be curious to read
why the decided not to go that way, given their scale and the potential gains
from not using a general-purpose protocol.

(Something like, make the entire standard journey from opening to the app to
requesting a car over something custom, and then leave the rest of the app
using TCP)

~~~
telotortium
I'm assuming their thought process went like this: "we want to optimize a
workflow which today goes over HTTP RPC/REST for mobile networks where the
retry behavior of TCP at the low level is suboptimal. QUIC already exists, and
we don't have to figure out our own solutions to firewalls, security, HTTP
semantics. Also our webapps use the same or similar API endpoints, so the less
effort spent writing new routing, monitoring, etc., the better. Oh, it works
really well. Awesome, avoided 6-24 months debugging a custom protocol over
mobile networks around the world that we don't control."

------
OrgNet
This kind of latency improvement only matters if they are planning to do auto-
pilot from the cloud? (that would be crazy, especially if they don't have a
fallback)

~~~
eropple
Does it necessarily imply that, though? applications that are snappy and
responsive tend to improve users' impression of the app and probably (it
is/was true for browsers when I worked in this realm, I have no reason to
doubt that it's true for mobile today) improve user engagement.

~~~
Phlarp
I think the prevailing wisdom is that engagement suffers more from latency on
mobile.

