Hacker News new | past | comments | ask | show | jobs | submit login
RaptorQ and performance optimization in Rust (cberner.com)
127 points by raccoonone on March 30, 2019 | hide | past | favorite | 48 comments



When comparing a Rust rewrite of a C++ codebase, I noticed that Rust's default HashMap and HashSet were noticably slower than unordered_map and unordered_set. The reason is that Rust uses a hash function that has better properties, but that is a bit slower. If the performance of a HashMap is a bottleneck, one can use a simpler hash function to get the same performance as in C++. With FnvHashMap, the performance was the same as unordered_map.



It soon will be the standard hashmap!


Does it still provide the DoS protection SipHash offered, or does that mean that Rust is moving from secure-by-default to fastest-by-default when it comes to HashMap ? If so it's kind of a big philosophical switch.


Hashbrown is not about the hashing algorithm. It uses fx hash which is pretty commonly used in Rust. I imagine they will still keep SipHash


OK thanks. But that would mean the performance gain would not be as dramatic I guess ?


It's still pretty major, because SIMD, but not as dramatic

https://github.com/Amanieu/hashbrown#performance

Note that the hashbrown crate uses FxHash by default, so the second table is the fair comparison. When ported to the stdlib it will likely be changed to have SipHash as the default.

For the standard hashmap, replacing SipHash with FxHash gets you a 4x perf boost. Replacing the standard hashmap with hashbrown gets you a 2x perf boost (when you're using FxHash, there are no numbers here for when you're using SipHash but I suspect it will still be significant, if smaller)


Thanks a lot for the clarification!


Thanks! I just replaced FnvHashMap with hashbrown::HashMap and get a 20% overall performance improvement in my workload (which is rather HashMap-heavy).


Slightly tangential but c++ hash tables aren't even the fastest. The requirements on iterator invalidation force the use of extra allocations and indirections for separate chaining.

I've hand-rolled and/or reused hash tables that use various probing methods (removing both allocations and extra indirections) and the performance improvement is non-trivial (for my workloads anyhow).


I've been surprised by how slow the C++ standard data structures are (likely due to the need to support a lot of functionality). I once wrote a basic heap in C and tried benchmarking its performance at -O3 relative to the C++ STL heap. Was expecting mine to be slower compared to the highly optimized STL heap, but was wondering how much slower. Instead, it turned out to be 1.5-2x as fast (and this was despite freeing up memory when the underlying vector shrunk - something which the C++ vector does not do).


I have a C++ implementation of Raptorq.

Never went for performance, but at the most I use `std::map<int, pointer>` (which is a RB-tree).

You can implement RQ without a hashmap.

It is only used to track the symbol number (uint32), so introducing hashing sounds a bit wasteful. You could also do a normal vector actually, but since the symbol numbers are taken from the network, that might result in a lot of reordering or pointless allocation if you are not really careful

--edit: typos


I have played with a C++ implementation of RFC6330, and while I have to say I never went for performance, I find this benchmark a bit... pointless.

RaptorQ is kind of a gaussian elimination of a matrix, so it all depends on the block size (=>matrix size). The algorithm has basically cubic complexity on the number of symbols in a block. RFC6330 is made to work on files, which are divided into blocks with a certain number of symbols, and the bytes are interleaved.

This implementation does not do the (complex and almost pointless) interleaving, which is fine, even OpenRQ does not.

The bench seems to be done on a.... 10kb file? It all fits in the L2. We are not given the symbol size (which determines the block size!) and I assume all of this fits in a 10x10 matrix.

You are benchmarking operations on a matrix that is (more or less) a 10x10 byte matrix.

The biggest part of this benchmark might almost be the generation of repair symbols (was it even done?), since that would require multiple xoring of the above-mentioned symbols.

This is much closer to micro-benchmarking than an actual benchmark, imho. It would have been more interesting to see what happens with files at least larger than the L3 cache.

You can also cache intermediate results, (which he does not do) which is especially useful for encoding, but only when working on matrix >= 100x100, otherwise just searching the cache, getting from memory (my implementation optionally did LZ4 compression/decompression) and doing a matrix multiplication is slower than just computing the matrix again.

Still, it's nice to see implementations of the RFC, which is a real pain to read...


Does anyone know of a good (preferably pure) Go implementation? https://github.com/google/gofountain has mad a good start and implemented the older plain Raptor standard, but I haven't been able to find a good Go library.

If anyone is qualified enough to take a crack at it I can try and arrange for sponsorship as well.


RaptorQ, even in this implementation, appears to be a lot slower than wirehair: https://github.com/catid/wirehair


Thanks for the pointer, any other similar algorithms?

O(n) sounds impressive, although I heard such claims on RQ too...

RQ complexity is O(n^3), but on the internal matrix, not on the input. It all depends on the blocksize you choose.

The numbers are impressive, too. Encoding 40Mbytes in 0usec? Will have to check the theory behind that.


You could look at ldpc triangle/staircase. They're fast even in a fairly naive implementation but have fairly high overhead unless most of your input is the systematic part of the codeword.


For anyone interested in how this code works, here is Qualcomm's technical overview of RaptorQ: https://www.qualcomm.com/media/documents/files/raptorq-techn...


I don't know much about SIMD but are there no alignment concerns with reinterpreting a slice as an AVX vector?


Unaligned and aligned loads of AVX vectors have basically the same performance since Ivy Bridge IIRC.


I was under the impression that unaligned ops ran at the same speed, but they used up more register ports, so it does reduce memory bandwidth between the register file and cache. Or does this no longer apply either?


My understanding is that the first unaligned load uses more register ports[0], but a second (third, etc) contiguous load doesn't. IANA[intel microarchitechure expert] though.

0: Or more memory bandwidth anyway.


Very cool efforts here. Curious what the speed is without those instructions present on the cpu, and what hardware this was run on.

The concept of a fountain seems interesting but what is a good use case? Variable strength error correction?


Let's say you want to send something (say 100KB) to a million listeners, but you don't know which of them is going to be listening when. You'd feed that 100KB into a RaptorQ encoder, configure it to 1KB packets, and it would give you stream of near-infinite 1KB packets that you could broadcast (usually over a satellite, but UDP, multicast, QR codes all work).

Receivers would listen for as many of these packets as they can, as and when they can (the transmission can be "lossy"), and if a receiver pickups up 100 unique packets, any 100 unique packets in any order, it'll have a 99.9% (or something along those lines) chance of decoding the message successfully. Each extra unique packet adds a 9 to the chances.

That's why the "fountain". It's a data fountain, and you can grab any quantity of data off it at any point and still have a good chance of reconstructing the message.


Very interesting. Still struggling to see how this would be more efficient than looping that data for the use case of getting beacon data out. Perhaps it's more consistent/predictable in scenarios of high or targeted loss (ie first X bytes keep getting dropped)?


If you’re transferring 5GB over 1KB packets over UDP or a satellite, the benefits are very clear. Without a fountain code, there are 5,000,000 packets you need to receive. Miss one packet, and your have to sit through another 4,999,999 packets and pray very hard that you get your packet reliably the next time.

With a fountain code, it’s not a problem. If you miss one packet, you have a 99% chance of making do with the 5,000,001st packet that comes. 99.9% of making do with the 5,000,002nd packet etc.

In other words, without a fountain code, if you want to reliably receive 1GB, and you miss a packet, your satellite will have to transmit 2GB just for your sake, which is expensive af. With a fountain code, it’ll only need to send 1.01GB.


It vastly reduces the chance of you receiving a packet that contains no new information. If you just rebroadcast the same packets over and over again, a receiver that has 999/1000 packets may have a very low chance of serendipitously receiving the packet they need.


Not an expert, but if you miss a packet with the looping you have to wait for that particular packet, whereas with the fountain it sounds like you can get that back quicker.


Multicast reliable transmission that scales well to many many clients.


Can anyone suggest a good book that would serve as an introduction to finite field arithmetic? I keep randomly running into it (e.g. this post), but don't understand it well enough to follow the discussion.


If you would prefer video, then UoCambridge's recording of "Lecture 5: Entropy and Data Compression (IV): Shannon's Source Coding Theorem, Symbol Codes and Arithmetic Coding" is up here: http://videolectures.net/mackay_course_05/

In case you don't find what you're looking for, here are all lectures on "Information Theory, Pattern Recognition, and Neural Networks": http://videolectures.net/course_information_theory_pattern_r...

I'd be remiss not to mention that the lectures are by the late David MacKay: https://news.ycombinator.com/item?id=11500221


Huffman: Fundamentals of Error-Correcting Codes

ISBN 9780521782807


Anybody want to comment on whether all of his unsafe code was actually necessary? Seems bad for rust that safe code is 25x slower.


> Seems bad for rust that safe code is 25x slower.

Most of the optimizations listed have nothing to do with unsafe, and have to do with algorithmic changes that you could apply in any language, and could apply to safe code just as well as to unsafe code. Even if we pretend all unsafe used there is necessary, and interpret the source in the worst possible light for rust - that the optimization steps would be impossible without unsafe - I think we'd only be talking 4x to 5x slower from this article, not 25x.

But - I strongly suspect not all that unsafe is necessary, and the post certainly doesn't claim it is.

The first code snippet uses unsafe to skip bounds checks. There are plenty of other ways to get the optimizer to skip them for you in safe code - or at the very least, hoist them out of the loop instead. This seems like a good read showing some concrete examples:

https://coaxion.net/blog/2018/01/speeding-up-rgb-to-grayscal...

Explicitly using SIMD code like the second code snippet does might require unsafe for the transmuted slicing under the hood (implemented as raw pointer derefs in this case), but it's possible to build safe abstractions on top of that. That blog post links this, for example, although it didn't end up using it (EDIT: Actually, it initially didn't, but now does): https://github.com/AdamNiederer/faster . It does use unsafe internally: https://github.com/AdamNiederer/faster/search?q=unsafe&unsco...


I don't think 25x speed difference can be attributed to safe code. In most cases safe code just means extra bounds checking or refcounting here and there, which aren't that bad even in hot loops.

The fragment of code in the article with `get_unchecked_mut()` shouldn't be necessary. It's a simple case that LLVM should be able to optimize. And if it didn't, it could be helped by either iterating both slices with `zip()` or a trick `let slice = &slice[0..len];` which proves to LLVM that you have the required length and it doesn't need to be checked again.

But overall the article seems in line what you'd expect from Rust: you have low-level control over memory layout and safety checks, and you can make trade-offs to squeeze maximum performance out of an algorithm if you need to.


SIMD requires unsafe for some reason. Not sure exactly why.

I don't see why they used unsafe in `add_assign` - an assert!(octets.len() == other.len()) would likely have elided the bounds checks.


I talk a little about why in my "fearless SIMD" blog post (also introducing a prototype that wraps the unsafety but at the moment is very limited in exactly what SIMD operations are exposed). https://raphlinus.github.io/rust/simd/2018/10/19/fearless-si...


https://users.rust-lang.org/t/how-to-zip-two-slices-efficien... explores eliding the bounds checks even without an assert! (though by using the smaller length as the bounds, which might not be the desired behavior)


Anything outside of rust (like assembly or simd) is not safe-ensurable by rusts build-in, just like interfacing with c libraries in golang is not safe.

I would imagine without these simd optimizations you would see a 2X slowdown at least.


I ran the benchmarks with SIMD disabled. Encoding goes from ~950Mbit/s down to ~350Mbit/s


What is the patent situation with Raptor codes nowadays?


We have a license and use them for satellite broadcasting, but my understanding is Qualcomm has made statements in https://datatracker.ietf.org/ipr/1511/ that if you use it for a "wireless wide-area standard (for example, a UMTS-compatible handset or Infrastructure equipment)" you'll be charged a standard royalty fee, otherwise they don't care.


Noob question: Any reason folks would use FountainCode/RaptorQ over TurboCode [0] / PolarCode [1]? What's the difference?

[0] https://en.wikipedia.org/wiki/Turbo_code

[1] https://en.wikipedia.org/wiki/Polar_code_(coding_theory)


I do not think the "non-fountain" style of error correction has the nice property of "just grab any reasonable fraction of the packets and you will get to decode the whole message". With a Turbo/LDPC/Polar/etc code you encode a packet, send it, and when it is received it is either decoded or not, but there is no notion of a message spread redundantly over many packets, where any n of them are enough for the decoding of the full message.


... as long as you use it for implementing a specific IETF specification. At least that how it reads to me?


Sure, I guess? I don't see why you'd want to deviate from the RFC6330 https://tools.ietf.org/html/rfc6330 - stuff like this is complicated a.f.


Was just coming to ask the same question. Last I remember fountain codes were pretty locked down.


The inventor of Raptor codes, M. Amin Shokrollahi, sold his company, Digital Fountain, to Qualcomm. Upon the sale to Qualcomm, Qualcomm acquired all of Digital Fountain's IP rights.

Qualcomm has asserted that these Raptor code-related patents (an early one of which was filed in 2004) are standards essential, and require to be licensed from Qualcomm.[1][2]

The below-linked patent would expire in 2024. However, there are a slew of continuation applications that expire much later than 2024.

Update: additionally, there is at least one earlier-dated patent filed in 1999, which expired in February. [3]

[1] https://datatracker.ietf.org/ipr/2554/

[2] https://patents.google.com/patent/US7139960

[3] https://patents.google.com/patent/US6307487




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: