
RaptorQ and performance optimization in Rust - raccoonone
https://www.cberner.com/2019/03/30/raptorq-rfc6330-rust-optimization/
======
lower
When comparing a Rust rewrite of a C++ codebase, I noticed that Rust's default
HashMap and HashSet were noticably slower than unordered_map and
unordered_set. The reason is that Rust uses a hash function that has better
properties, but that is a bit slower. If the performance of a HashMap is a
bottleneck, one can use a simpler hash function to get the same performance as
in C++. With FnvHashMap, the performance was the same as unordered_map.

~~~
v_lisivka
Try hashbrown. Author declares 8x improvement over standard HashMap.

[https://github.com/Amanieu/hashbrown](https://github.com/Amanieu/hashbrown)

[https://blog.waffles.space/2018/12/07/deep-dive-into-
hashbro...](https://blog.waffles.space/2018/12/07/deep-dive-into-hashbrown/)

[https://gankro.github.io/blah/hashbrown-
insert/](https://gankro.github.io/blah/hashbrown-insert/)

~~~
steveklabnik
It soon will be the standard hashmap!

~~~
littlestymaar
Does it still provide the DoS protection SipHash offered, or does that mean
that Rust is moving from secure-by-default to fastest-by-default when it comes
to HashMap ? If so it's kind of a big philosophical switch.

~~~
rapsey
Hashbrown is not about the hashing algorithm. It uses fx hash which is pretty
commonly used in Rust. I imagine they will still keep SipHash

~~~
littlestymaar
OK thanks. But that would mean the performance gain would not be as dramatic I
guess ?

~~~
Manishearth
It's still pretty major, because SIMD, but not as dramatic

[https://github.com/Amanieu/hashbrown#performance](https://github.com/Amanieu/hashbrown#performance)

Note that the hashbrown crate uses FxHash by default, so the second table is
the fair comparison. When ported to the stdlib it will likely be changed to
have SipHash as the default.

For the standard hashmap, replacing SipHash with FxHash gets you a 4x perf
boost. Replacing the standard hashmap with hashbrown gets you a 2x perf boost
(when you're using FxHash, there are no numbers here for when you're using
SipHash but I suspect it will still be significant, if smaller)

~~~
littlestymaar
Thanks a lot for the clarification!

------
Luker88
I have played with a C++ implementation of RFC6330, and while I have to say I
never went for performance, I find this benchmark a bit... pointless.

RaptorQ is kind of a gaussian elimination of a matrix, so it all depends on
the block size (=>matrix size). The algorithm has basically cubic complexity
on the number of symbols in a block. RFC6330 is made to work on files, which
are divided into blocks with a certain number of symbols, and the bytes are
interleaved.

This implementation does not do the (complex and almost pointless)
interleaving, which is fine, even OpenRQ does not.

The bench seems to be done on a.... 10kb file? It all fits in the L2. We are
not given the symbol size (which determines the block size!) and I assume all
of this fits in a 10x10 matrix.

You are benchmarking operations on a matrix that is (more or less) a 10x10
byte matrix.

The biggest part of this benchmark might almost be the generation of repair
symbols (was it even done?), since that would require multiple xoring of the
above-mentioned symbols.

This is much closer to micro-benchmarking than an actual benchmark, imho. It
would have been more interesting to see what happens with files at least
larger than the L3 cache.

You can also cache intermediate results, (which he does not do) which is
especially useful for encoding, but only when working on matrix >= 100x100,
otherwise just searching the cache, getting from memory (my implementation
optionally did LZ4 compression/decompression) and doing a matrix
multiplication is slower than just computing the matrix again.

Still, it's nice to see implementations of the RFC, which is a real pain to
read...

------
sudhirj
Does anyone know of a good (preferably pure) Go implementation?
[https://github.com/google/gofountain](https://github.com/google/gofountain)
has mad a good start and implemented the older plain Raptor standard, but I
haven't been able to find a good Go library.

If anyone is qualified enough to take a crack at it I can try and arrange for
sponsorship as well.

------
nullc
RaptorQ, even in this implementation, appears to be a lot slower than
wirehair:
[https://github.com/catid/wirehair](https://github.com/catid/wirehair)

~~~
Luker88
Thanks for the pointer, any other similar algorithms?

O(n) sounds impressive, although I heard such claims on RQ too...

RQ complexity is O(n^3), but on the internal matrix, not on the input. It all
depends on the blocksize you choose.

The numbers are impressive, too. Encoding 40Mbytes in 0usec? Will have to
check the theory behind that.

~~~
nullc
You could look at ldpc triangle/staircase. They're fast even in a fairly naive
implementation but have fairly high overhead unless most of your input is the
systematic part of the codeword.

------
mike00632
For anyone interested in how this code works, here is Qualcomm's technical
overview of RaptorQ: [https://www.qualcomm.com/media/documents/files/raptorq-
techn...](https://www.qualcomm.com/media/documents/files/raptorq-technical-
overview.pdf)

------
eridius
I don't know much about SIMD but are there no alignment concerns with
reinterpreting a slice as an AVX vector?

~~~
ajtulloch
Unaligned and aligned loads of AVX vectors have basically the same performance
since Ivy Bridge IIRC.

~~~
jnordwick
I was under the impression that unaligned ops ran at the same speed, but they
used up more register ports, so it does reduce memory bandwidth between the
register file and cache. Or does this no longer apply either?

~~~
a1369209993
My understanding is that the first unaligned load uses more register ports[0],
but a second (third, etc) contiguous load doesn't. IANA[intel
microarchitechure expert] though.

0: Or more memory bandwidth anyway.

------
layoutIfNeeded
What is the patent situation with Raptor codes nowadays?

~~~
sudhirj
We have a license and use them for satellite broadcasting, but my
understanding is Qualcomm has made statements in
[https://datatracker.ietf.org/ipr/1511/](https://datatracker.ietf.org/ipr/1511/)
that if you use it for a "wireless wide-area standard (for example, a UMTS-
compatible handset or Infrastructure equipment)" you'll be charged a standard
royalty fee, otherwise they don't care.

~~~
ignoramous
Noob question: Any reason folks would use FountainCode/RaptorQ over TurboCode
[0] / PolarCode [1]? What's the difference?

[0]
[https://en.wikipedia.org/wiki/Turbo_code](https://en.wikipedia.org/wiki/Turbo_code)

[1]
[https://en.wikipedia.org/wiki/Polar_code_(coding_theory)](https://en.wikipedia.org/wiki/Polar_code_\(coding_theory\))

~~~
krastanov
I do not think the "non-fountain" style of error correction has the nice
property of "just grab any reasonable fraction of the packets and you will get
to decode the whole message". With a Turbo/LDPC/Polar/etc code you encode a
packet, send it, and when it is received it is either decoded or not, but
there is no notion of a message spread redundantly over many packets, where
any n of them are enough for the decoding of the full message.

------
tgtweak
Very cool efforts here. Curious what the speed is without those instructions
present on the cpu, and what hardware this was run on.

The concept of a fountain seems interesting but what is a good use case?
Variable strength error correction?

~~~
sudhirj
Let's say you want to send something (say 100KB) to a million listeners, but
you don't know which of them is going to be listening when. You'd feed that
100KB into a RaptorQ encoder, configure it to 1KB packets, and it would give
you stream of near-infinite 1KB packets that you could broadcast (usually over
a satellite, but UDP, multicast, QR codes all work).

Receivers would listen for as many of these packets as they can, as and when
they can (the transmission can be "lossy"), and if a receiver pickups up 100
unique packets, any 100 unique packets in any order, it'll have a 99.9% (or
something along those lines) chance of decoding the message successfully. Each
extra unique packet adds a 9 to the chances.

That's why the "fountain". It's a data fountain, and you can grab any quantity
of data off it at any point and still have a good chance of reconstructing the
message.

~~~
tgtweak
Very interesting. Still struggling to see how this would be more efficient
than looping that data for the use case of getting beacon data out. Perhaps
it's more consistent/predictable in scenarios of high or targeted loss (ie
first X bytes keep getting dropped)?

~~~
sudhirj
If you’re transferring 5GB over 1KB packets over UDP or a satellite, the
benefits are very clear. Without a fountain code, there are 5,000,000 packets
you need to receive. Miss one packet, and your have to sit through another
4,999,999 packets and pray very hard that you get your packet reliably the
next time.

With a fountain code, it’s not a problem. If you miss one packet, you have a
99% chance of making do with the 5,000,001st packet that comes. 99.9% of
making do with the 5,000,002nd packet etc.

In other words, without a fountain code, if you want to reliably receive 1GB,
and you miss a packet, your satellite will have to transmit 2GB just for your
sake, which is expensive af. With a fountain code, it’ll only need to send
1.01GB.

------
minimax
Can anyone suggest a good book that would serve as an introduction to finite
field arithmetic? I keep randomly running into it (e.g. this post), but don't
understand it well enough to follow the discussion.

~~~
ignoramous
If you would prefer video, then UoCambridge's recording of "Lecture 5: Entropy
and Data Compression (IV): Shannon's Source Coding Theorem, Symbol Codes and
Arithmetic Coding" is up here:
[http://videolectures.net/mackay_course_05/](http://videolectures.net/mackay_course_05/)

In case you don't find what you're looking for, here are all lectures on
"Information Theory, Pattern Recognition, and Neural Networks":
[http://videolectures.net/course_information_theory_pattern_r...](http://videolectures.net/course_information_theory_pattern_recognition/)

I'd be remiss not to mention that the lectures are by the late David MacKay:
[https://news.ycombinator.com/item?id=11500221](https://news.ycombinator.com/item?id=11500221)

------
empath75
Anybody want to comment on whether all of his unsafe code was actually
necessary? Seems bad for rust that safe code is 25x slower.

~~~
staticassertion
SIMD requires unsafe for some reason. Not sure exactly why.

I don't see why they used unsafe in `add_assign` - an assert!(octets.len() ==
other.len()) would likely have elided the bounds checks.

~~~
raphlinus
I talk a little about why in my "fearless SIMD" blog post (also introducing a
prototype that wraps the unsafety but at the moment is very limited in exactly
what SIMD operations are exposed).
[https://raphlinus.github.io/rust/simd/2018/10/19/fearless-
si...](https://raphlinus.github.io/rust/simd/2018/10/19/fearless-simd.html)

