RaptorQ and performance optimization in Rust

lower · on March 30, 2019

When comparing a Rust rewrite of a C++ codebase, I noticed that Rust's default HashMap and HashSet were noticably slower than unordered_map and unordered_set. The reason is that Rust uses a hash function that has better properties, but that is a bit slower. If the performance of a HashMap is a bottleneck, one can use a simpler hash function to get the same performance as in C++. With FnvHashMap, the performance was the same as unordered_map.

v_lisivka · on March 30, 2019

Try hashbrown. Author declares 8x improvement over standard HashMap.

https://github.com/Amanieu/hashbrown

https://blog.waffles.space/2018/12/07/deep-dive-into-hashbro...

https://gankro.github.io/blah/hashbrown-insert/

steveklabnik · on March 30, 2019

It soon will be the standard hashmap!

littlestymaar · on March 31, 2019

Does it still provide the DoS protection SipHash offered, or does that mean that Rust is moving from secure-by-default to fastest-by-default when it comes to HashMap ? If so it's kind of a big philosophical switch.

rapsey · on March 31, 2019

Hashbrown is not about the hashing algorithm. It uses fx hash which is pretty commonly used in Rust. I imagine they will still keep SipHash

littlestymaar · on March 31, 2019

OK thanks. But that would mean the performance gain would not be as dramatic I guess ?

Manishearth · on March 31, 2019

It's still pretty major, because SIMD, but not as dramatic

https://github.com/Amanieu/hashbrown#performance

Note that the hashbrown crate uses FxHash by default, so the second table is the fair comparison. When ported to the stdlib it will likely be changed to have SipHash as the default.

For the standard hashmap, replacing SipHash with FxHash gets you a 4x perf boost. Replacing the standard hashmap with hashbrown gets you a 2x perf boost (when you're using FxHash, there are no numbers here for when you're using SipHash but I suspect it will still be significant, if smaller)

littlestymaar · on March 31, 2019

Thanks a lot for the clarification!

lower · on March 30, 2019

Thanks! I just replaced FnvHashMap with hashbrown::HashMap and get a 20% overall performance improvement in my workload (which is rather HashMap-heavy).

jjnoakes · on March 31, 2019

Slightly tangential but c++ hash tables aren't even the fastest. The requirements on iterator invalidation force the use of extra allocations and indirections for separate chaining.

I've hand-rolled and/or reused hash tables that use various probing methods (removing both allocations and extra indirections) and the performance improvement is non-trivial (for my workloads anyhow).

ummonk · on March 31, 2019

I've been surprised by how slow the C++ standard data structures are (likely due to the need to support a lot of functionality). I once wrote a basic heap in C and tried benchmarking its performance at -O3 relative to the C++ STL heap. Was expecting mine to be slower compared to the highly optimized STL heap, but was wondering how much slower. Instead, it turned out to be 1.5-2x as fast (and this was despite freeing up memory when the underlying vector shrunk - something which the C++ vector does not do).

Luker88 · on March 31, 2019

I have a C++ implementation of Raptorq.

Never went for performance, but at the most I use `std::map<int, pointer>` (which is a RB-tree).

You can implement RQ without a hashmap.

It is only used to track the symbol number (uint32), so introducing hashing sounds a bit wasteful. You could also do a normal vector actually, but since the symbol numbers are taken from the network, that might result in a lot of reordering or pointless allocation if you are not really careful

--edit: typos

Luker88 · on March 31, 2019

I have played with a C++ implementation of RFC6330, and while I have to say I never went for performance, I find this benchmark a bit... pointless.

RaptorQ is kind of a gaussian elimination of a matrix, so it all depends on the block size (=>matrix size). The algorithm has basically cubic complexity on the number of symbols in a block. RFC6330 is made to work on files, which are divided into blocks with a certain number of symbols, and the bytes are interleaved.

This implementation does not do the (complex and almost pointless) interleaving, which is fine, even OpenRQ does not.

The bench seems to be done on a.... 10kb file? It all fits in the L2. We are not given the symbol size (which determines the block size!) and I assume all of this fits in a 10x10 matrix.

You are benchmarking operations on a matrix that is (more or less) a 10x10 byte matrix.

The biggest part of this benchmark might almost be the generation of repair symbols (was it even done?), since that would require multiple xoring of the above-mentioned symbols.

This is much closer to micro-benchmarking than an actual benchmark, imho. It would have been more interesting to see what happens with files at least larger than the L3 cache.

You can also cache intermediate results, (which he does not do) which is especially useful for encoding, but only when working on matrix >= 100x100, otherwise just searching the cache, getting from memory (my implementation optionally did LZ4 compression/decompression) and doing a matrix multiplication is slower than just computing the matrix again.

Still, it's nice to see implementations of the RFC, which is a real pain to read...

sudhirj · on March 30, 2019

Does anyone know of a good (preferably pure) Go implementation? https://github.com/google/gofountain has mad a good start and implemented the older plain Raptor standard, but I haven't been able to find a good Go library.

If anyone is qualified enough to take a crack at it I can try and arrange for sponsorship as well.

nullc · on March 30, 2019

RaptorQ, even in this implementation, appears to be a lot slower than wirehair: https://github.com/catid/wirehair

Luker88 · on March 31, 2019

Thanks for the pointer, any other similar algorithms?

O(n) sounds impressive, although I heard such claims on RQ too...

RQ complexity is O(n^3), but on the internal matrix, not on the input. It all depends on the blocksize you choose.

The numbers are impressive, too. Encoding 40Mbytes in 0usec? Will have to check the theory behind that.

nullc · on April 2, 2019

You could look at ldpc triangle/staircase. They're fast even in a fairly naive implementation but have fairly high overhead unless most of your input is the systematic part of the codeword.

mike00632 · on March 30, 2019

For anyone interested in how this code works, here is Qualcomm's technical overview of RaptorQ: https://www.qualcomm.com/media/documents/files/raptorq-techn...

lilyball · on March 30, 2019

I don't know much about SIMD but are there no alignment concerns with reinterpreting a slice as an AVX vector?

ajtulloch · on March 30, 2019

Unaligned and aligned loads of AVX vectors have basically the same performance since Ivy Bridge IIRC.

jnordwick · on March 31, 2019

I was under the impression that unaligned ops ran at the same speed, but they used up more register ports, so it does reduce memory bandwidth between the register file and cache. Or does this no longer apply either?

a1369209993 · on March 31, 2019

My understanding is that the first unaligned load uses more register ports[0], but a second (third, etc) contiguous load doesn't. IANA[intel microarchitechure expert] though.

0: Or more memory bandwidth anyway.

tgtweak · on March 30, 2019

Very cool efforts here. Curious what the speed is without those instructions present on the cpu, and what hardware this was run on.

The concept of a fountain seems interesting but what is a good use case? Variable strength error correction?

sudhirj · on March 30, 2019

Let's say you want to send something (say 100KB) to a million listeners, but you don't know which of them is going to be listening when. You'd feed that 100KB into a RaptorQ encoder, configure it to 1KB packets, and it would give you stream of near-infinite 1KB packets that you could broadcast (usually over a satellite, but UDP, multicast, QR codes all work).

Receivers would listen for as many of these packets as they can, as and when they can (the transmission can be "lossy"), and if a receiver pickups up 100 unique packets, any 100 unique packets in any order, it'll have a 99.9% (or something along those lines) chance of decoding the message successfully. Each extra unique packet adds a 9 to the chances.

That's why the "fountain". It's a data fountain, and you can grab any quantity of data off it at any point and still have a good chance of reconstructing the message.

tgtweak · on March 30, 2019

Very interesting. Still struggling to see how this would be more efficient than looping that data for the use case of getting beacon data out. Perhaps it's more consistent/predictable in scenarios of high or targeted loss (ie first X bytes keep getting dropped)?

sudhirj · on March 31, 2019

If you’re transferring 5GB over 1KB packets over UDP or a satellite, the benefits are very clear. Without a fountain code, there are 5,000,000 packets you need to receive. Miss one packet, and your have to sit through another 4,999,999 packets and pray very hard that you get your packet reliably the next time.

With a fountain code, it’s not a problem. If you miss one packet, you have a 99% chance of making do with the 5,000,001st packet that comes. 99.9% of making do with the 5,000,002nd packet etc.

In other words, without a fountain code, if you want to reliably receive 1GB, and you miss a packet, your satellite will have to transmit 2GB just for your sake, which is expensive af. With a fountain code, it’ll only need to send 1.01GB.

staticfloat · on March 31, 2019

It vastly reduces the chance of you receiving a packet that contains no new information. If you just rebroadcast the same packets over and over again, a receiver that has 999/1000 packets may have a very low chance of serendipitously receiving the packet they need.

foota · on March 30, 2019

Not an expert, but if you miss a packet with the looping you have to wait for that particular packet, whereas with the fountain it sounds like you can get that back quicker.

bb88 · on March 30, 2019

Multicast reliable transmission that scales well to many many clients.

minimax · on March 30, 2019

Can anyone suggest a good book that would serve as an introduction to finite field arithmetic? I keep randomly running into it (e.g. this post), but don't understand it well enough to follow the discussion.

ignoramous · on March 30, 2019

If you would prefer video, then UoCambridge's recording of "Lecture 5: Entropy and Data Compression (IV): Shannon's Source Coding Theorem, Symbol Codes and Arithmetic Coding" is up here: http://videolectures.net/mackay_course_05/

In case you don't find what you're looking for, here are all lectures on "Information Theory, Pattern Recognition, and Neural Networks": http://videolectures.net/course_information_theory_pattern_r...

I'd be remiss not to mention that the lectures are by the late David MacKay: https://news.ycombinator.com/item?id=11500221

layoutIfNeeded · on March 30, 2019

Huffman: Fundamentals of Error-Correcting Codes

ISBN 9780521782807

empath75 · on March 30, 2019

Anybody want to comment on whether all of his unsafe code was actually necessary? Seems bad for rust that safe code is 25x slower.

MaulingMonkey · on March 30, 2019

> Seems bad for rust that safe code is 25x slower.

Most of the optimizations listed have nothing to do with unsafe, and have to do with algorithmic changes that you could apply in any language, and could apply to safe code just as well as to unsafe code. Even if we pretend all unsafe used there is necessary, and interpret the source in the worst possible light for rust - that the optimization steps would be impossible without unsafe - I think we'd only be talking 4x to 5x slower from this article, not 25x.

But - I strongly suspect not all that unsafe is necessary, and the post certainly doesn't claim it is.

The first code snippet uses unsafe to skip bounds checks. There are plenty of other ways to get the optimizer to skip them for you in safe code - or at the very least, hoist them out of the loop instead. This seems like a good read showing some concrete examples:

https://coaxion.net/blog/2018/01/speeding-up-rgb-to-grayscal...

Explicitly using SIMD code like the second code snippet does might require unsafe for the transmuted slicing under the hood (implemented as raw pointer derefs in this case), but it's possible to build safe abstractions on top of that. That blog post links this, for example, although it didn't end up using it (EDIT: Actually, it initially didn't, but now does): https://github.com/AdamNiederer/faster . It does use unsafe internally: https://github.com/AdamNiederer/faster/search?q=unsafe&unsco...

pornel · on March 30, 2019

I don't think 25x speed difference can be attributed to safe code. In most cases safe code just means extra bounds checking or refcounting here and there, which aren't that bad even in hot loops.

The fragment of code in the article with `get_unchecked_mut()` shouldn't be necessary. It's a simple case that LLVM should be able to optimize. And if it didn't, it could be helped by either iterating both slices with `zip()` or a trick `let slice = &slice[0..len];` which proves to LLVM that you have the required length and it doesn't need to be checked again.

But overall the article seems in line what you'd expect from Rust: you have low-level control over memory layout and safety checks, and you can make trade-offs to squeeze maximum performance out of an algorithm if you need to.

staticassertion · on March 30, 2019

SIMD requires unsafe for some reason. Not sure exactly why.

I don't see why they used unsafe in `add_assign` - an assert!(octets.len() == other.len()) would likely have elided the bounds checks.

raphlinus · on March 31, 2019

I talk a little about why in my "fearless SIMD" blog post (also introducing a prototype that wraps the unsafety but at the moment is very limited in exactly what SIMD operations are exposed). https://raphlinus.github.io/rust/simd/2018/10/19/fearless-si...

MaulingMonkey · on March 30, 2019

https://users.rust-lang.org/t/how-to-zip-two-slices-efficien... explores eliding the bounds checks even without an assert! (though by using the smaller length as the bounds, which might not be the desired behavior)

tgtweak · on March 30, 2019

Anything outside of rust (like assembly or simd) is not safe-ensurable by rusts build-in, just like interfacing with c libraries in golang is not safe.

I would imagine without these simd optimizations you would see a 2X slowdown at least.

raccoonone · on March 31, 2019

I ran the benchmarks with SIMD disabled. Encoding goes from ~950Mbit/s down to ~350Mbit/s

layoutIfNeeded · on March 30, 2019

What is the patent situation with Raptor codes nowadays?

sudhirj · on March 30, 2019

We have a license and use them for satellite broadcasting, but my understanding is Qualcomm has made statements in https://datatracker.ietf.org/ipr/1511/ that if you use it for a "wireless wide-area standard (for example, a UMTS-compatible handset or Infrastructure equipment)" you'll be charged a standard royalty fee, otherwise they don't care.

ignoramous · on March 30, 2019

Noob question: Any reason folks would use FountainCode/RaptorQ over TurboCode [0] / PolarCode [1]? What's the difference?

[0] https://en.wikipedia.org/wiki/Turbo_code

[1] https://en.wikipedia.org/wiki/Polar_code_(coding_theory)

krastanov · on March 30, 2019

I do not think the "non-fountain" style of error correction has the nice property of "just grab any reasonable fraction of the packets and you will get to decode the whole message". With a Turbo/LDPC/Polar/etc code you encode a packet, send it, and when it is received it is either decoded or not, but there is no notion of a message spread redundantly over many packets, where any n of them are enough for the decoding of the full message.

detaro · on March 30, 2019

... as long as you use it for implementing a specific IETF specification. At least that how it reads to me?

sudhirj · on March 30, 2019

Sure, I guess? I don't see why you'd want to deviate from the RFC6330 https://tools.ietf.org/html/rfc6330 - stuff like this is complicated a.f.

mmastrac · on March 30, 2019

Was just coming to ask the same question. Last I remember fountain codes were pretty locked down.

herniatedeel · on March 30, 2019

The inventor of Raptor codes, M. Amin Shokrollahi, sold his company, Digital Fountain, to Qualcomm. Upon the sale to Qualcomm, Qualcomm acquired all of Digital Fountain's IP rights.

Qualcomm has asserted that these Raptor code-related patents (an early one of which was filed in 2004) are standards essential, and require to be licensed from Qualcomm.[1][2]

The below-linked patent would expire in 2024. However, there are a slew of continuation applications that expire much later than 2024.

Update: additionally, there is at least one earlier-dated patent filed in 1999, which expired in February. [3]

[1] https://datatracker.ietf.org/ipr/2554/

[2] https://patents.google.com/patent/US7139960

[3] https://patents.google.com/patent/US6307487