
Achieving reliable UDP transmission at 10 Gb/s using BSD socket - pmoriarty
https://arxiv.org/abs/1706.00333
======
ignoramous
> (abstract) _Optimizations for throughput are: MTU, packet sizes, tuning
> Linux kernel parameters, thread affinity, core locality and efficient
> timers._

Cloudflare's u/majke shared a series of articles on a similar topic [0][1][2]
(with focus on achieving line-rate with higher packets-per-second and lower
latency instead of throughput) that I found super helpful especially since
they are so very thorough [3].

Speaking of throughput, u/drewg123 wrote an article on how Netflix does
100gbps _with_ FreeBSD's network stack [4] and here's BBC on how they do so by
_bypassing_ Linux's network stack [5].

\---

[0]
[https://news.ycombinator.com/item?id=10763323](https://news.ycombinator.com/item?id=10763323)

[1]
[https://news.ycombinator.com/item?id=12404137](https://news.ycombinator.com/item?id=12404137)

[2]
[https://news.ycombinator.com/item?id=17063816](https://news.ycombinator.com/item?id=17063816)

[3]
[https://news.ycombinator.com/item?id=12408672](https://news.ycombinator.com/item?id=12408672)

[4]
[https://news.ycombinator.com/item?id=15367421](https://news.ycombinator.com/item?id=15367421)

[5]
[https://news.ycombinator.com/item?id=16986100](https://news.ycombinator.com/item?id=16986100)

------
snisarenko
Optimizing UDP transmission over internet is an interesting topic.

I remember reading a paper a while ago that showed that if you send two
consecutive UDP packets with exact same data over the internet, at least 1 of
them will arrive to the destination at pretty high success rate (something
like 99.99%)

I wonder if this still works with current internet infrastructure, and if this
trick is still used in real-time streaming protocols.

~~~
zamadatix
99.99% for two tries would be a 1% drop chance which I'd say is pretty lenient
- we average better than that on our sites running off 4G (jitter is horrible
though and that will kill any real-time protocols without huge delays added).

Generally you'd just implement a more generic FEC algorithm though unless you
had 2 separate paths you wanted to try (e.g. race a cable modem and 4G with
every packet and if one side drops it hope the other side still finishes the
race) as there are FEC options that allow non integer redundancy levels and
can reduce header overhead compared to sending multiple copies of small
packets.

~~~
syrrim
>99.99% for two tries would be a 1% drop chance

Not per se. The drop chance for consecutive packets is likely correlated, such
that if you know the first one was dropped you should increase your prior that
the second one will also be dropped.

~~~
zamadatix
Depends on the cause and root question. For instance in the most common
scenario of congestion routers do intelligent random drops with increasing
probability as the buffer gets more full
[https://en.wikipedia.org/wiki/Random_early_detection](https://en.wikipedia.org/wiki/Random_early_detection).
The internet actually relies on this random low drop chance to make things
work smoothly rather than waiting til things are failing apart to signal to
streams to slow down all at once while it catches up. Same randomness with
transmission bit errors which will cause drops but the randomness is not by
design as much as by the way noise is what is causing those.

On the other hand if the root question is if there is an outage style issue
then yeah if the path to the destination is having a hard down style issue no
number of packets are going to help because they are all going to drop.
Likewise if the question is "on a short enough time scale is reliability of
delivering a single packet somewhere on the internet ever less than 99%" then
yeah somewhere there is a failure scenario and if you look at a short enough
time scale any failure scenario can be made to say there is 0% reliability.

------
exdsq
Can you do something similar with TCP and increase the packet size such that
the "TCP Overhead" is reduced compared to 64 byte payloads but with the
increased reliability over UDP?

~~~
toast0
In the system proposed, not really.

To use TCP instead of UDP there are two big problems:

1) the sensor device would need to keep unacknowledged data in memory, but it
may not have enough memory for that

2) if they're running at line rate (max bandwidth in this case) in UDP,
there's no bandwidth left to retransmit data

All of the buffer manipulation is going to be more CPU intensive on both sides
as well, and you'd run into congestion control limiting the data rate in the
early part of the capture as well.

For a system like this, while UDP doesn't guarantee reliability, careful
network setup (either sensor direct to recorder, or on a dedicated network
with sufficient capacity and no outside traffic) in combination with careful
software setup allows for a very low probability of lost packets dispite no
ability to retransmit.

------
zamadatix
"In a readout system such as ours the network only consists of a data sender
and a data receiver with an optional switch connecting them. Thus the only
places where congestion occurs are at the sender or receiver. The readout
system will typically produce data at near constant rates during measurements
so congestion at the receiver will result in reduced data rates by the
transmitter when using TCP."

At that point a better paper title would have been "Increasing buffers or
optimizing application syscalls to receive 10 GB/s of data" as it has nothing
to do with achieving reliable UDP transmission, which it doesn't even seem
they needed:

"For some detector readout it is not even evident that guaranteed delivery is
necessary. In one detector prototype we discarded around 24% of the data due
to threshold suppression, so spending extra time making an occasional
retransmission may not be worth the added complexity"

As far as actual reliable UDP testing at high speeds one might also want to
consider the test scenario as not all Ethernet connections are equal. The 2
meter passive DACs used in this probably achieve ~10^-18 bit error rate (BER)
or 1 bit error in every ~100 petabytes transferred. On the other hand go
optical even with forward error correction (FEC) it's not uncommon to expect
transmission loss in the real world. E.g. looking at something a little more
current [https://blogs.cisco.com/sp/transforming-enterprise-
applicati...](https://blogs.cisco.com/sp/transforming-enterprise-applications-
with-25g-ethernet-smf) is happy to call 10^-12 with FEC "traditionally
considered to be 'error free'" which would have likely resulted in lost
packets even in this 400 GB transfer test (though again they were fine with up
to 24% loss in some cases so I don't think they were worried about reliable as
much as reading the paper title would suggest).

Generally if you have any of these: 1) unknown congestion 2) unknown speed 3)
unknown tolerance for error

You'll have to do something that eats CPU time and massive amounts of buffers
for reliability. If you need the best reliability you can get but you don't
have the luxury of retransmitting for whatever reason then as much error
correction in the upper level protocol as you can afford from a CPU
perspective is your best bet.

If you want to see a modern take on achieving reliable transmission over UDP
check out HTTP/3.

~~~
ignoramous
> _Generally if you have any of these: 1) unknown congestion 2) unknown speed
> 3) unknown tolerance for error_

> ... _If you want to see a modern take on achieving reliable transmission
> over UDP check out HTTP /3._

Not an expert but I have seen folks here complain that QUIC / HTTP3 doesn't
have a proper congestion control like uTP (BitTorrent over UDP) does with
LEDBAT:
[https://news.ycombinator.com/item?id=10546651](https://news.ycombinator.com/item?id=10546651)

~~~
wmf
LEDBAT-style congestion control is not proper for "foreground" Web traffic and
it will result in lower performance than TCP-based HTTP. Fixing bufferbloat is
an ongoing project and it isn't fair to blame QUIC for being no worse than
TCP.

------
rubatuga
TLDR:

    
    
       sysctl -w net.core.rmem_max=12582912
       sysctl -w net.core.wmem_max=12582912
       sysctl -w net.core.netdev_max_backlog=5000
       ifconfig eno49 mtu 9000 txqueuelen 10000 up

------
mynegation
Relevant discussion on HN from 4 months ago of IBM’s proprietary large data
transfer tool:
[https://news.ycombinator.com/item?id=21898072](https://news.ycombinator.com/item?id=21898072)

------
Matthias247
Reading through the paper I can't see what the authors mean with "reliable
transmission" there, and how they achieve it.

I only see them referencing having increased socket buffers, which then lead -
in combination with the available (and non-congested) network bandwidth and
their app sending behavior - to no transmission errors. As soon as you change
any of those parameters it seems like the system would break down, and they
have absolutely no measures in place to "make it reliable".

The right answer still seems: Implement a congestion controller, retransmits,
etc. - which essentially ends up in implementing TCP/SCTP/QUIC/etc

~~~
rubatuga
They want reliable UDP, not TCP. They state that very clearly.

~~~
zamadatix
Yes but they didn't do anything to make UDP reliable they just said in our
test scenario we didn't notice any loss at the application layer after
increasing the socket receive buffer and called it a day because elsewhere in
the paper they noted "For some detector readout it is not even evident that
guaranteed delivery is necessary. In one detector prototype we discarded
around 24% of the data due to threshold suppression, so spending extra time
making an occasional retransmission may not be worth the added complexity."

I think the paper meant "reliable" in a different way than most would take
"reliable" to mean on a paper about networking similar to if someone created a
paper about "Achieving an asynchronous database for timekeeping" and spent a
lot of time talking about databases in the paper but it turns out by
"asynchronous" they meant you could enter your hours at the end of the week
rather than the moment you walked in/out of the door.

~~~
touisteur
I just think they meant reliable in a 'how to dimension to greatly reduce the
possible loss'. No protocol is 'fully' reliable in all dimensions (latency,
message loss, throughput). Sometimes you benchmark your exact physical conf
and you add large margins, add some packet loss detection mechanisms,
eventually retries (but if your latency requirements are hard no dice) or
duplicate the physical layer (oh god, de-duplication at 10GbE...) or just
accept some losses.

I just meant 'reliable is a spectrum'...

~~~
p1necone
Reliability in the context of networking protocols means a specific thing to
me - guaranteeing packet delivery (to the extent that it is physically
possible of course).

This does seem to be a technical term with a defined meaning that matches my
assumption too:
[https://en.wikipedia.org/wiki/Reliability_(computer_networki...](https://en.wikipedia.org/wiki/Reliability_\(computer_networking\))

~~~
greglindahl
Try considering what the authors of the paper mean.

~~~
p1necone
If the authors of the paper are using a term that already has a specific
meaning in the area they are working in, but meaning something different from
that then they are making a mistake.

------
bogomipz
From the abstract:

>"In addition UDP also supported on a variety of small hardware platforms such
as Digital Signal Processors (DSP) Field Programmable Gate Arrays (FPGA)"

I am curious what would be the use case for implementing a network stack and
using UDP directly in a DSP chip? Perhaps I have a very narrow understanding
of DSPs.

~~~
touisteur
Well you might have to use a DSP to get the signal from your ADC to your PC
for signal processing. You might find 8-core DSPs with built-in 10GbE
capabilities easier to program than a 10GbE IP on a FPGA...

------
a_t48
I wish I had seen this at my last job. This is something I had to set up and
it was painful - lots of trial and error.

------
otterley
(2017)

------
fulafel
This would be interesting to try on today's faster ethernet speeds, wonder how
it goes at 100G.

