The basic concept is that if I give you five points all on the same line, you only need any two of them to reconstruct the line.
Reed Solomon doesn't use lines (it isn't optimal) and the geometry they use isn't the Cartesian plane (this requires infinite precision) - but this is what the codes do!
Just like it requires two points to reconstruct a line, three points are required to reconstruct a simple polynomial (degree 2). Reed Solomon uses "high degree polynomials" which require n+1 points to reconstruct the degree-n polynomial.
The codes take data, make a polynomial out of it, and then send points from the polynomial to the other side, which can interpolate it as soon as enough points are gathered!
All of this looks like impenetrable discretization and matrix operations if you just get a description of the algorithms, which makes it seem a lot less approachable than it is.
This was in fact the first decoding algorithm for Reed–Solomon codes, but it's not very efficient when you don't know which of your points are the ones with the corrupted data; you kind of have to guess, which gets expensive fast when you're trying to tolerate more than one or two errors.
Although it turns out that there are better algorithms for that, the most commonly used RS codes don't actually encode a series of points on a polynomial, as subjoriented's comment at the root of this thread suggests; instead they are so-called "BCH codes", where you consider the data you transmit (both the original data and the "parity" symbols) to be actually a sequence of coefficients. So where does the redundancy come from? After all, any set of numbers is valid as the coefficients of a polynomial.
BCH codes require the polynomial to be divisible by a known generator polynomial, and that's where you get the redundancy you need to correct errors. Gorenstein and Zierler published an efficient error-correction ("decoding") algorithm for BCH codes in 1960; the first efficient decoding algorithm for original-view RS codes (where you transmit a series of points) wasn't found until 1986, and even today, BCH-code decoding algorithms are more efficient than original-view decoding algorithms.
The Gorenstein–Zierler algorithm works by evaluating the received polynomial at the roots of the generator polynomial. Since it's supposed to be a multiple of the generator, it should be zero at those roots, as the generator is; if it's nonzero, the values at those roots are due to some "error polynomial" that's been conceptually added to your message in transit. If you suppose that it has only a few nonzero coefficients, there's a clever way to compute the error polynomial from those values that should have been zero. This allows you to subtract it from the received message to correct the errors.
At least, I think that's how it works. I haven't implemented an RS decoder yet, so I might be getting some of this wrong.
But something like that was what I was actually hoping to read at the above link.
It runs in O(n^3) time, and unfortunately is not the state of the art. What actual Reed-Solomon implementations use are the Berlekamp-Massey and Forney algorithms, which run in O(n^2) time.
Also, the BCH encoder multiplies the input sequence with the generator polynomial, which is a convolution of their coefficients. Fourier-type transforms (i.e. number theoretic transform) relate convolution and pointwise multiplication, so I feel there’s an underlying connection here, but I don’t have enough experience with finite fields to connect the dots...
What dots do you want to connect? (Anyway you can connect them with a high-degree polynomial curve. :-) )
They mention that both procedures are basically a multiplication by a Vandermonde matrix, with the Fourier transform having the additional requirement of using the n-th roots of unity.
Please do this for Berlekamp-Massey algorithm! Is it EM algorithm for this problem, or am I smoking banana peels?
I have a background in error codes - hamming, and luby and raptor and, and, and. Reed-Solomon is particularly "beautiful" from a mathematics perspective (more so than Raptor) - something I can usually explain to anyone and get them interested.
It looks to be built on matrix algebra.
Kinda nifty how simple and elegant the underlying concept is, really. Reminds me of the surprising conceptual simplicity behind Diffie-Hellman.
It should be (says the mathematician)! That it doesn't probably comes from confusing the elegant theoretical perspective on finite fields to the computational perspective, which (at least from my pure-math point of view) can get pretty hairy.
I'm down the Wikipedia rabbit hole too, now.
That is very efficient. So in exchange for storing all your data about twice, you can detect all errors where ~30% of the bits are flipped. You can even _correct_ all errors where only 12% of the bits were flipped!
I still consider it magic that such an efficient and 'perfect' code exists. There is unfortunately a proof that is the best one, and that there is no bigger perfect one.
If you’re interested in Reed-Solomon codes (and other error correcting codes), I recommend reading Huffman’s (NB: not the same Huffman as in Huffman codes!) textbook on the subject:
Fundamentals of Error Correcting Codes
Glad to see other people getting excited by them too!
Evan back with old analog TV broadcasts, if there was a signal issue, the picture would just get fuzzy. Now with HDTV, the picture goes away completely.
I know a lot of this has to do with analog formats being highly inefficient and thus having lots of redundancy.
In fact, Europe (or rather large parts of) seems to be on the verge of shutting down terrestrial TV entirely.
You take any message and can generate any amount of coding symbols from the message, and a receiver that gathers _any_ K of those coding symbols can decode the message.
The mental image I have is using these to send messages into space -- a receiver could pick up at any point and just wait until they have any K coding symbols and they'll be able to decode the message.
My understanding of all this stuff comes from USENET and par/par2/yenc (where, IIRC, the message gets packaged up in plaintext blocks and if your local server is missing N blocks you have to collect N parity blocks and you can recover from there, but my recollection is probably fuzzy), and I imagine rateless codes would be quite useful for USENET binaries.
People don't realise how ubiquitous different forms of encoding schemes are (I am leaving out compression/decompression schemes altogether here, but it has always amazed me as to what profound impact Information Theory has, and yet it is so out of the way):
Forward error-correction codes (FEC) at L2: IBM's elegant 8b/10b encoding https://en.wikipedia.org/wiki/8b/10b_encoding used in pre-10g era Ethernet, and its successor 64b/66b https://en.wikipedia.org/wiki/64b/66b_encoding used in post-10g era Ethernet / Infiniband / RoCE, and the sister 128b/132b encoding used by PCI-e 3 and USB 3 standards.
Linear error-correcting codes (LEC) for Volatile Memory: https://en.wikipedia.org/wiki/Hamming_code
FEC for HD Voice Transmission over Cellular Networks: AMR used in 3G and VoIP https://en.wikipedia.org/wiki/Adaptive_Multi-Rate_audio_code... and AMR WB for VoLTE aka HD Voice: https://en.wikipedia.org/wiki/Adaptive_Multi-Rate_Wideband
Fountain Codes for Digital TV on Mobile (DVB-H): Qualcomm's Raptors https://en.wikipedia.org/wiki/Raptor_code
Turbo Codes for 3G/4G Connectivity (OFDM), Satellite communication: https://en.wikipedia.org/wiki/Turbo_code
Optimal Erasure Coding (for data at rest, and for Mobile TV, and for 3G/4G): Reed-Solomon and its variants are a clear winner here (used by dropbox https://blogs.dropbox.com/tech/2016/07/pocket-watch/ and rumored to have been used by S3 https://perspectives.mvdirona.com/2009/06/erasure-coding-and... )
Upcoming: https://en.wikipedia.org/wiki/Low-density_parity-check_code LDPC codes are recommended for use by data-channel in 5G replacing Reed-Solomon whilst Polar Codes are adopted for the control-channel in 5G https://en.wikipedia.org/wiki/Polar_code_(coding_theory)
Such a brilliant system.
zfec (gpl2+): https://github.com/tahoe-lafs/zfec (discussion: https://news.ycombinator.com/item?id=12976168)
liberasurecode (bsdesque): https://github.com/openstack/liberasurecode
cm256 (3 clause bsd): https://github.com/catid/cm256
openrq (apache2.0): http://openrq-team.github.io/openrq/
Regardless, thanks for these!!
In terms of turbo codes, I think the CCSDS turbo code is out of patent, at least 5,446,747 has been expired for over 5 years.
It's very enlightening to see finite-field arithmetic implemented in a few lines of Python --- it gives something concrete to look at and think about.
That backblaze speed quoted seems really slow.