
Fast Reed-Solomon Coding in Go - pandemicsyn
http://blog.klauspost.com/blazingly-fast-reed-solomon-coding/
======
Smerity
Whilst I quite enjoy Go, one annoyance I have is that it seems there is no
middle ground between alright performance using pure Go and great performance
using Go with assembler. Many other languages allow you, if you contort it, to
get good performance without leaving the language. In this code, it was at
least well sourced (and hopefully battle tested) SSE3 assembler for the Galois
field arithmetic.

I hit this repeatedly when writing a pet project govarint[1] which performs
integer encoding and decoding. I don't expect blazingly fast performance if I
don't use assembler, but bitwise operations are painfully slow compared to the
equivalent in C. C would also win handsomely even if the compilers were
equivalent due to no boxing. It's also additional overhead that you now have
two divergent code paths - the Go code (fallback in case you don't have
assembler for the architecture) and the assembler code - that would both need
to be kept up to date and checked for bugs / errors / inconsistencies.

Am I missing something? I only ever see assembler checked in to repositories -
is there a typical sane build process I've just not seen? Is it just a matter
of waiting for the Go compiler to "level up" optimizations to match GCC/Clang?
Or should I just learn to live with the fact that to code Go I'll also need to
reacquaint myself in assembler again?

[1]:
[https://github.com/Smerity/govarint](https://github.com/Smerity/govarint)

~~~
pcwalton
There's no free lunch. One of Golang's primary goals is to compile fast (which
is admirable!), which it achieves in part by leaving out lots of
optimizations. Bringing 6g/8g up to the same level of optimizations as
LLVM/GCC will, in all likelihood, slow down its compilation time. I can't
speak for where along the line of speed versus code quality Go's developers
want to place the compiler in the long run, nor am I saying you can't do
better than both GCC and LLVM in compilation speed, but I do think there's a
fundamental tradeoff there that no project can eliminate (short of writing in
assembler, that is—which is, for all its downsides, an approach that does
eliminate the tradeoff).

~~~
XorNot
...I don't see it.

Compile time optimization is a thing you can do for your production build
only. Unless Go's design implicitely inhibits certain types of optimizations,
but that would seem odd given how opinionated it is. What you do in Go, and
what you mean, are much clearer then in most other languages.

~~~
kibwen
The go tool has no distinction between development builds and production
builds; every build is a production build. This itself is a manifestation of
Go's opinionated philosophy.

[https://golang.org/cmd/go/#hdr-
Compile_packages_and_dependen...](https://golang.org/cmd/go/#hdr-
Compile_packages_and_dependencies)

------
zaroth
Very nice work, and thanks for releasing under MIT.

Interesting that Backblaze decided to go with 17+3 configuration. As I read
more on the topic, I realized I haven't quite wrapped my head around how data
and parity are allocated across symbols, stripes, and disks, and how all that
impacts degraded read performance, but at first glance I would assume 17+3
would have a pretty terrible degraded read performance.

Another great read if you're interested in home brew RAID or using parity is
Plank's paper "Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-
like Systems". [1]

I found that gem reading Adam's blog on implementing triple parity RAID in
ZFS. [2]

[1] -
[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.103...](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.103.7089&rep=rep1&type=pdf)

[2] -
[https://blogs.oracle.com/ahl/entry/triple_parity_raid_z](https://blogs.oracle.com/ahl/entry/triple_parity_raid_z)

~~~
klauspost
_Very nice work, and thanks for releasing under MIT._

Cheers. The hard work goes to BB, their initial library, though :)

 _Interesting that Backblaze decided to go with 17+3 configuration._

Yeah. It seems they don't go for any RAID within a storage pod, but instead
uses raw ext4, and distributes shards across pods instead of hard drives in a
single pod. [1]

This would of course mean that a single pod isn't loaded when a drive is
replaced, but it is distributed across 17 other pods.

[1] [https://www.backblaze.com/blog/vault-cloud-storage-
architect...](https://www.backblaze.com/blog/vault-cloud-storage-
architecture/)

~~~
haikubrian
(BrianB from Backblaze here)

Yes, that's right. When a drive is replaced, its contents are rebuilt from the
data on the 19 corresponding drives in the other 19 pods in the vault.

------
signa11
the article contains an excellent introduction to "Galois Field Arithmetic"
available via the following:
[http://www.snia.org/sites/default/files2/SDC2013/presentatio...](http://www.snia.org/sites/default/files2/SDC2013/presentations/NewThinking/EthanMiller_Screaming_Fast_Galois_Field%20Arithmetic_SIMD%20Instructions.pdf).

there is linux-kernel article by hpa:
[https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf](https://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf)

~~~
zaroth
Great links, thanks!

At the end of the presentation, they mention rotated reed-soloman array codes,
which led me to another paper Plank worked on: "Rethinking Erasure Codes for
Cloud File Systems: Minimizing I/O for Recovery and Degraded Reads" which was
very interesting as well. They directly address the problem of read
performance in a degraded state.

[http://www.cs.jhu.edu/~okhan/fast12.pdf](http://www.cs.jhu.edu/~okhan/fast12.pdf)

~~~
zzzcpan
I wonder why Rabin's IDA ("Efficient Dispersal of Information for Security,
Load Balancing, and Fault Tolerance") is not mentioned anywhere

[http://web.archive.org/web/20070221140752/http://discovery.c...](http://web.archive.org/web/20070221140752/http://discovery.csc.ncsu.edu/~aliu3/reading_group/p335-rabin.pdf)

------
creshal
I was under the assumption that R-S has been replaced with turbocodes in most
use cases. Why not use those? Patent issues?

~~~
DanWaterworth
I believe the difference is that in this case, RS is being used as an erasure
code (which means, you either have the data or you don't; the drive has either
failed or it hasn't). Failures are (approximately) uncorrelated (assuming
different parts are stored on different hosts), so for this use case, RS
approaches optimal in terms of storage used (some metadata needs to be stored,
which is overhead).

------
rakoo
Quick question about the API: why separate the Split phase and the Encode
phase ?

In other words, are there any scenarios where one would want to Split data but
not Encode it, or where data is automatically split and all that is left to do
is to Encode ?

------
0xdeadbeefbabe
On a somewhat related note, what is the meaning of 6a and 6c? I suppose 6g is
the go compiler and 6l is the linker. Is 6a an assembler and 6c a c compiler?

------
atYevP
Yev from Backblaze here -> Nice work!

