
Reed-Solomon coder computing one million ECC blocks at 1 GB/s - 112233
https://github.com/Bulat-Ziganshin/FastECC
======
nemo1618
We've been using klauspost's Golang reedsolomon package in our distributed
storage package. It runs approximately as fast (~1GB) because the hot paths
are written in asm:
[https://github.com/klauspost/reedsolomon](https://github.com/klauspost/reedsolomon)

Erasure codes are pretty magical.

~~~
zokier
I'm not sure if they are quite in the same class. Or maybe I'm
misunderstanding, but the Go implementation says "Note that the maximum number
of data shards is _256_ " while FastECC says "Current implementation is
limited to _2^20_ blocks, removing this limit is the main priority [...]". Or
are shards and blocks different things here?

~~~
acrefoot
They are pretty different.

Blocks are like blocks in a filesystem or on a disk, while shards are like
RAID stripes/mirrors. 2^20 blocks would make for a pretty small datastore,
depending on what your needs are, and seems a more serious limitation than the
256 shards.

Not an exact description, but a hopefully workable analogy.

~~~
acrefoot
Actually, it's not clear to me if FastECC is saying that it works on 1M blocks
together as a unit, in which case they would be the same as the shards. If
that's the case, it does make this a totally different class of thing--what
are the use cases?

~~~
sliken
I think it's mostly simplicity. Say your ECC code is only fast with say 17
shards/blocks of data and 3 shards/blocks of parity. I hear numbers like that
used for hadoop, backblaze, microsoft and facebook object storage, etc.

So sure you can split large files into groups of 20 pieces (17 data + 3
parity). But the real world often has low error rates, but the errors that do
happen can happen in bursts. Say adjacent nodes dying, power strips/circuits
blowing, complete loss of network connectivity for 30 seconds, etc. BTW, yes
I'm aware you can tell hadoop to keep 2 copies in rack and 1 copy outside the
rack, that's basically crude interleaving.

If your code can handle a larger number of blocks/shards you can average your
parity over a larger number shards/blocks so you can plan for the average
error rate, but still handle substantial bursts of error.

So maybe instead of 17+3 (17% overhead, survives 3 lost) you switch to
2048+192 (9.4% parity). That way you can survive a 9.4% error rate, even if
that's 4 errors in a row. Granted updates become much more expensive, but for
some cases it would be well justified by being more durable.

~~~
Bulat-Ziganshin
if they will need that, they will find a method long ago. but they try to
minimize amount of data read/written for every transaction. look up for
example "pyramid codes" \- it's a way to use 20 disks in raid system but
read/write only 5-6 for any operation. and MS use that in azure afaik

with 2000 data blocks you will need to read back all those datya blocks to
generate new parity when any of these data blocks was changed. and this
decreases overall perfromance 2000 times. so you should see why they need
those pyrammid codes rather than something on opposite side

------
gbrown_
Firstly this looks cool and I'll be looking at it closer when I get home from
work.

> Note that all speeds mentioned here are measured on i7-4770, employing all
> features available in a particular program - including multi-threading, SIMD
> and x64 support.

I hope this doesn't wind up infringing on the Streamscale patent[1] as some
other works have in this field. Again I need to give this a proper read.

[1]
[https://www.google.com/patents/US8683296](https://www.google.com/patents/US8683296)

~~~
Bulat-Ziganshin
the math behind it is completely different. it doesn't even use GF(2^n) fields
at all :)

as mentioned in RS.md, the encoding sheme is known since 2003 and even became
a part of RFC

------
planteen
I've implemented a Reed-Solomon encoder for telemetry blocks over a RF
channel. There, I was using a (255, 223) code on GF(2^8). There were
interleaving parameters for a telemetry "packet" but each encoding took
constant O(1) time. So the overall time to encode a downlink was just O(N).
What exactly is meant by "Reed-Solomon ECC"? I'm guessing the code is doing
something different related to data storage than what a traditional
communications RS code does?

~~~
Bulat-Ziganshin
Let's see: in order to produce 32 parity blocks from 255 data blocks you are
probably perform 32x255 multiplications, i.e. NxM operations where N,M is
number of data/parity blocks. for a fixed N/M rate it is considered as O(N^2)
algo

Once you fixed N and M, encoding speed will be fixed in both your and my algo.
In mine, it's 1 GB/s for N=M=524288

And yes, it's doing the same as your algo - generate parity blocks from data
blocks and then restore from any combination of N survived blocks

------
sliken
Anyone compared this to the intel ISA-L library that uses the various intel
specific instructions to accelerate reed solomon encoding?

I believe this is what Facebook, Microsoft, and similar organizations are
going to avoid Hadoops 3x replication overhead, while still keeping the same
degree of protection.

[https://github.com/01org/isa-l](https://github.com/01org/isa-l)

~~~
Bulat-Ziganshin
i (fastecc author) don't have isa-l benchmarks at hand, but i compared fastecc
to the best O(N^2) implementation i know - MultiPar. My conclusion is that
MultiPar will be slower than FastECC starting from ~32 ECC blocks

------
Chris2048
Are there any good explanations of Reed-Solomon? The mathematics look a bit
hairy..

~~~
hcs
The Wikipedia article has a section "Reed & Solomon's original view: The
codeword as a sequence of values" which I think is fairly easy to follow.

The fundamental idea is a polynomial of degree n is uniquely described by n+1
points (2 points give a unique line, 3 points a unique parabola, etc). If the
polynomial is the data you want to protect, you can pick enough points to
describe it redundantly. Then, if a point is lost, as long as there are enough
to still uniquely determine the original polynomial, you can recover the
original data. And if the points do not all come from the same polynomial (as
when the data has been corrupted) you can at least detect this, and you may be
able to throw out some of the points to recover.

(Edited to correct that the polynomial is the original message, not some set
of points)

~~~
Chris2048
Just to interrogate this description:

So, there is a trick to picking the points such that _any_ can be lost, and
only the remaining number matters? I think this was the case with par2, that
used RS - you could lose _any_ of the files, and as long as you had enough
left, you could infer the lost files. I guess this method is the bit I get
stuck on.

How are 'corrupt' points detected? By determining that no polynomial is
described by a set of points, and then computing the least number of points
needed to remove for a polynomial-describing set of points?

~~~
infogulch
From the Go implementation linked upthread:

> The encoder _does not know which parts are invalid_ , so if data corruption
> is a likely scenario, you need to implement a hash check for each shard.

The coder just does some simple matrix math to reconstruct what you tell it to
reconstruct. You have to build the checking and detection yourself.

~~~
algo646464
This is true only for erasure RS-codes, but not for general RS-codes. In
erasure RS-codes you must specify which points are corrupted and the decoder
discards them.

For general RS-codes you don't need to detect which points are corrupted. An
intuitive explanation of the decoding process here is that, if a small number
of evaluation points are corrupted, then the original degree n polynomial is
still the closest polynomial to the (slightly corrupted) set of evaluation
points. The decoding algorithm uses this fact.

This however comes at cost. General RS-codes can correct only half as many
errors as Erasure RS-codes.

------
pjkundert
For a pretty quick implementation w/ good C/C++ APIs, as well as BCH decoding,
check out [https://github.com/pjkundert/ezpwd-reed-
solomon](https://github.com/pjkundert/ezpwd-reed-solomon)

Also quickly available as a JavaScript and Python binding...

------
Tomte
I'm always confused about all those codes (RS, LDPC, Turbo etc.).

If you have the parchive use case, i.e. you want to "armor" files on disk so
that you can detect and correct bit flips (including bursts), which codes are
useful and which aren't?

RS obviously is (parchive, optical disks etc.). But are fountain codes usable?
Low density parity codes? I have no idea how I would select a code family.

~~~
112233
Also depends on failure mode of your media. E.g. if you can lose data by 4k
sectors, then it makes sense to use RS with 4k symbols instead of 8 bit (if
there is such a thing...), because you do not need partial sector recovery. If
you can flip separate bits, then BCH is kind of optimal. If you plan to
stream, maybe some sort of FEC code, et cettera. For example, in ye olde NAND
that had 512 byte pages with 16 byte OOB area per page it made sense to use
BCH, because bit errors were kind of independent from each other. Bit errors
in TLC NAND are not independant anymore.

~~~
Bulat-Ziganshin
when encoding larger blocks, they are just split into multiple small words.
f.e when GF(256) is used, each word is just a single byte. then you have f.e.
20 data blocks and encode corresponding words of each blocks as a single
group, and each group generates a single word for each of parity blocks. you
may consider it just as interleaving

overall, encoding in GF(2^n) is faster when n is smaller (since you need
smaller multiplication tables that better fits into cpu cache). But OTOH
encoding with K data+parity blocks require that 2^n>K, so for best speed n is
choosen as small as possible among 8/16/32 depending on the K value

~~~
112233
what I meant is that smallest unit that RS can correct is one word, which
usually is byte. So 8 bit errors in one word can be restored with less EC than
8 bit errors in different words.

In case of BCH, location of bit errors does not influence error correction
strength, so it is more suited to correcting independant bit errors, while RS
is more suited for bursty/whole block errors.

