
Forward error correction in QUIC - jsnell
https://docs.google.com/document/d/1Hg1SaLEl6T4rEU9j-isovCo8VEjjnuCPTcLNJewj7Nk/edit?pref=2&pli=1
======
ycmbntrthrwaway
Similar idea (coded TCP) is patented:
[http://www.codeontechnologies.com/technology/use-
cases/coded...](http://www.codeontechnologies.com/technology/use-cases/coded-
tcp/) They are selling licenses to Cisco:
[http://www.codeontechnologies.com/partners/](http://www.codeontechnologies.com/partners/)

The idea is simple: XOR random combinations of packets and use gaussian
elimination to restore them. Google uses simplest version of it, where they
XOR together groups of N packets and send them as N+1 packets, so they can
recover one packet if it is lost.

~~~
justicezyx
That sounds like network coding

~~~
colanderman
Ya, I don't understand how that's patented. It must be a very specific patent.
A few years ago when I first learned of the woes of TCP over lossy links
literally the first thought that popped into my head was "why not code N
packets into M>N and recover using Gaussian elimination over GF-2"? Such an
idea could have passed for non-obvious maybe 50 years ago but it's a pretty
basic idea in the 21st century.

To be clear, you can trivially apply _any_ technique here:
[https://en.wikipedia.org/wiki/Erasure_code](https://en.wikipedia.org/wiki/Erasure_code)
across packets in a channel.

------
Nrsolis
I've often wondered why not implement G.709 on the packet payload (as opposed
to the whole frame) and see how that compares to TCP retransmission.

[https://en.wikipedia.org/wiki/G.709](https://en.wikipedia.org/wiki/G.709)

~~~
ycmbntrthrwaway
I don't know about G.709, but in general coding, including FEC, is used on
link layer, as there it is possible to finely adjust it to achieve throughput
close to channel capacity and nearly zero packet loss ratio. It is the problem
of channel estimation and rate control. In ideal world where users don't
contend and always transmit at the same rate, it would be enough.

TCP solves another problem: it adjusts source rate. It assumes that link layer
works correctly and packets are not lost because incorrect rate (aka MCS) is
used. Instead, packet loss is used as the indicator of congestion: users send
data faster then it is possible to transmit it through bottleneck. It is hard
to predict and packets are not lost with the same probability on the TCP
timescale (RTT, around 50ms), so TCP can't react in time and select the code
that would correct losses without adding too much overhead. Thus, coding is
not used on TCP layer.

That is why Google focuses on _latency_ , not throughput. It is the area where
you can benefit from FEC on transmission layer. Another problem QUIC can solve
is bad rate control, such as Wi-Fi rate control on old Windows systems, but as
wireless internet usage moves to mobile, and Google controls android, it
becomes less of a problem as Google can just improve Wi-Fi rate control in
Linux if it fails.

~~~
Nrsolis
Yeah. I understand the problem that TCP is trying to solve.

My thinking was that there is a whole class of network applications that don't
work great under TCP because the retransmission mechanism drives latency super
high when congestion is experienced in the network. TCP works to back off
transmission that might be the cause of packet loss, but those packets still
have to be transmitted again and the bandwidth timeslot consumed by the
dropped packet amounts to wasted resources in the network. In a datacenter
microburst situation, retransmissions just tickle the problem you're trying to
avoid.

FEC over a lossy network would allow for a slight increase in the baseline
resource utilization with the side benefit of eliminating retransmissions
which might be beneficial in a microburst environment. I don't really know,
I'm just thinking aloud.

~~~
raattgift
Consider three conditions:

\- high link-layer integrity; packet loss is mainly due to congestion

\- low packet loss; most flows unaffected

\- bursty packet loss; individual flows, where they see any loss at all, have
a non-negligible probability of multiple back-to-back packets lost (e.g. in
FIFO queue tail drop, or a dynamic routing transient like a black hole or
brief (~ ttl) forwarding loop).

Consider the limit in which the number of simultaneous flows becomes large. In
that limit, there is a lot of additional FEC data that has been generated,
successfully transported across a network, and ultimately discarded as
redundant. There is also FEC data that fails to help in the reconstruction of
flows which experience a multi-packet loss too great for the erasure code
being used.

We can calculate wasted power in this limit, and in some situations that waste
can be significant.

One situation is when the FEC-generator is sending a large number of flows
that stay near steady-state equilibrium amongst themselves and with other
long-lived flows, which is reasonable to expect when multi-minute adaptive
streaming video is the dominant component of the traffic.

Conversely a transmission that is sufficiently short that it never enters into
the moral equivalent of congestion avoidance, and where the application is
also latency-sensitive, small amounts of FEC in the early parts of
conversations might be a win over retransmissions, depending on how often the
FEC forms part of loss recovery.

In the modern Internet, the cost-benefit analysis of the extra power the extra
FEC requires is probably strongly time-dependent, _and_ it may be counter-
intuitive. (e.g., if FEC doubles initial traffic, it may exacerbate any
burstiness, which may in turn lead to greater burst synchronization; that is,
in a microburst environment the additional FEC data, unless it's somehow
congestion-avoiding (hard to do for very short flows) it might be the opposite
of beneficial).

It's probably also worth considering differences in FEC receivers, for
instance whether they are drawing power from a small battery or not. Consider
a toy question like, "What drains more battery energy: two hours of receiving
FEC or two and a bit hours of receiving non-FEC?" "A bit" may matter. It might
also be dynamically calculable: lots of retransmits? Transition to sending FEC
proactively. No FEC use? Transition to reactive packet loss recovery.

Anyway, it's cool to get even a little look at what Google's learned from its
experiments. Everyone should encourage the sharing of these sorts of details,
and we should probably encourage other large scale entities with different
types of traffic (e.g. non-YouTube Google; Facebook; Twitter) to do likewise.

------
muhic
It is worth noting that this document reads more like a first draft, in fact
the document title is 'QUIC FEC v1'. No details/illustration of the
implementation and experimental results are given, instead you find statements
like 'Median and mean join latency increased'.

~~~
jsnell
It's not academic writing, just a short explanation on why something that was
expected to be a major advantage of QUIC is being removed from the spec for
now, due to having a negative impact on Youtube. (Of course that shows that
they're able to revert bad decisions, which is definitely an advantage over
TCP...)

Though there is an interesting rebuttal in the related mailing list thread [1]
suggesting that the feature is sound as-is, but was being applied in the wrong
way in these experiments. Specifically, that the largest expected gain is for
the client-sent packets very early during the connection setup.

[1]
[https://groups.google.com/a/chromium.org/forum/#!topic/proto...](https://groups.google.com/a/chromium.org/forum/#!topic/proto-
quic/Z5qKkk2XZe0)

------
kierank
Why not use something more advanced like LDPC instead of basic XOR (unless I'm
missing a trick)?

~~~
amock
I think the reason is that the computational power needed to use LDPC is much
greater than that required for XOR.

