Reed-Solomon error recovery in RAID-6

eps · on Jan 7, 2021

There's also this paper, which is quite good -

https://mirrors.edge.kernel.org/pub/linux/kernel/people/hpa/...

Also there are simpler, xor-based encoding schemes that allow recovering from double failures (e.g. EvenOdd and such), but the problem is that they are all patented.

est31 · on Jan 7, 2021

This E-Mail about increased numbers of parities is also interesting: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg...

huhtenberg · on Jan 7, 2021

That's the SnapRAID dev!

Also, check out this bit -

  >> Could you perhaps describe how the Cauchy matrix is 
  >> derived, and under what conditions it would become
  >> singular?

  > The Cauchy matrix has the mathematical property ...
  > ...
  > Besides the mathematical proof, I've also inverted all 
  > the 377,342,351,231 possible submatrices for up to 6 
  > parities and 251 data disks, and got an experimental 
  > confirmation of this.

"Trust, but verify" :)

[0] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg...

miahi · on Jan 7, 2021

The talk about striping (the fact that all the disks contain both data and the two levels of parity) is glossed over, but what it means is that there are not 7 types of failure/recovery, there are just two - when one drive fails, and when two drives fail. And the recovery actually contains all cases all the time (case 1-3 for 1 disk, case 4-7 for two disks).

dragontamer · on Jan 7, 2021

On the one hand, this is a great "brute force" method to learn how one implementation of RAID6 could happen. This blogpost achieves understanding of error-correction in the 2-parity case through brute force alone, and that's actually quite commendable.

On the other hand, its a very, very bad way to learn Reed Solomon codes. Reed Solomon codes "in general" are a concept that's broader than just 2-parity symbols... and also broader than just "protection vs erasures". Reed Solomon is designed to correct errors as well (a far "harder" computation than simply an erasure).

I've fortunately found a good "ELI5" version of Reed Solomon: https://www.youtube.com/watch?v=xE4jEKx9fTM

That's a very non-technical talk, but that talk gives the full concepts of "Polynomial in Finite Field" that Reed Solomon codes are based off of.

For a proper computer-based Reed Solomon code, you choose the GF(2^8) extension field for 8-bit symbols (Pararchive uses 2^16 for 16-bit symbols). Extension fields are extremely complicated to understand however, this will take a few days of Abstract Algebra study to understand how extension fields work.

But once you know how a finite extension field works, you then extend a polynomial into the extension field, and bam, you have Reed Solomon codes.

That is: a Quadratic equation can protect 3-data. All quadratic equations are ax^2 + bx + c, so any 3-points define a quadratic equation. Generate (x,y) coordinates on a graph as (0, data0), (1, data1), (2, data2).

Then, solve for a quadratic equation f(x) = ax^2 + bx + c.

Then, to generate parity, f(3) is your first parity symbol, f(4) is your next parity symbol, f(5) is the parity symbol after that. Etc. etc. Note in a 8-bit finite field, you can only go up to f(255) before "looping back" to f(0).

From there: 2-check symbols will protect against either 2-erasures or 1-error. So Quadratic Equation with 4-check symbols (aka: 4 "other points" of the quadratic equation passed with the data) can protect against 4 erasures... or 2-errors, or 2-erasures + 1 error.

klodolph · on Jan 7, 2021

Note that the size of the field only has to be as large as the size of the redundancy set (data + code). So if you are striping across 16 or fewer disks, you can use GF(2^4), but if you are striping across 17 disks you can’t use GF(2^4).

Well, if I remember the math correctly.

dragontamer · on Jan 7, 2021

> Well, if I remember the math correctly.

You're off by one. There's a "minus 1" because Reed Solomon is over the cyclical multiplication group, not over the cyclical additive group. (TL;DR: The number 0 cannot participate in... some portion of the formulation of the polynomials)

GF(2^3) is over blocksize of 7. GF(2^4) is over blocksize of 15. GF(2^8) is over 255.

---------

GF(2^8) is the smallest size in practice, because most people's data is organized into 8-bits-at-a-time today. (aka: a byte). But you're mostly right (once you account for the off-by-one error): there's a limit to 255 elements per block in GF(2^8) Reed Solomon.

But GF(2^3) or GF(2^4) are great for hand-exercises, the smaller size makes it far easier to compute calculations by hand and allow for the human to gain understanding. (aka: Octal 2^3 numbers or Hexadecimal 2^4 numbers). GF(2^2) is technically possible... but all the matrix math is over 1x1 matricies and not really a good opportunity to learn from. GF(2^1) is literally impossible for Reed Solomon to work in. So GF(2^3) is the smallest practical hand-exercise size (RS(7, 5) over GF(2^3) has a 5x7 generator matrix and a 7x2 check matrix. Small enough to do by hand though a bit tedious).

If you actually wanted to "split symbols up" so that you better detect bit-errors, then LDPC or Turbocodes are straight up superior over Reed Solomon for "small symbol sets". Reed Solomon is best used over "large symbols" (aka: 2^8 or larger).

----------

Any Reed Solomon code can be punctured (aka: fewer parity symbols), or shortened (aka: fewer data symbols) by simply setting locations of the Reed-Solomon generator and/or check matrix to 0.

As such, if you organize "smaller sets" into 2^8 (aka: 8-bit bytes), its just very natural to use a punctured + shortened Reed Solomon code instead of trying to get a 2^4 Reed Solomon code to work.

klodolph · on Jan 7, 2021

You’re right! Off by one.

> GF(2^8) is the smallest size in practice, because most people's data is organized into 8-bits-at-a-time today.

IIRC GF(2^4) is actually used in practice. Long story short, the CPU overhead of rebuilding data is a real concern.

dragontamer · on Jan 7, 2021

> IIRC GF(2^4) is actually used in practice. Long story short, the CPU overhead of rebuilding data is a real concern.

Interesting. I don't know about how these systems are implemented in practice, but I'd be surprised if GF(2^8) was "too big".

GF(2^8) was small enough that it was implemented in 1977 on the Voyager probe project. Yeah, it took a literal supercomputer to decode that RS-code in the 1980s, but modern computers are basically as fast as 1980s-era supercomputers.

Modern papers are talking about GF(2^32) to GF(2^128) these days. (https://www.ccis.northeastern.edu/home/alina/papers/tos-gf.p...). Most implementations don't need anything near that size, so GF(2^8) is sufficient.

---------

I mean, I've done GF(2^4) RS codes by hand. That's a very, very, very small system, human calculable with just pen and paper. So you're right that its far easier to calculate. I'm just kind of surprised that anyone would bother with a computer RS code of that miniscule size.

Ex: RS(15, 9) is a 9x15 generator matrix and a 6x15 check matrix over GF(2^4). Its on the large-side of hand-doable calculations, but absolutely doable by pen-and-paper (I recommend graph paper). Every symbol is a 2^4 hexadecimal symbol, so a single hand-written character (0123456789ABCDEF) is stored on a single cell in graph paper.

-------

Are you sure those systems are working in GF(16) fields? Not GF(2^16) AKA gf(65536) fields? GF(2^16) is more of the size I'd expect a computer to do, more so than GF(2^4).

klodolph · on Jan 7, 2021

Yes, I am sure that it’s GF(2^4) and not GF(2^16). The problem is not that GF(2^8) or GF(2^16) are “too big”, it’s just that GF(2^4) is faster and uses less CPU.

The purpose of using GF(2^128) or other large sizes is usually cryptographic, which has different tradeoffs. For example, Galois counter mode. The linked paper describes secure storage mechanisms.

Also note that you’re not just doing one field operation per byte, but several. If I’m remembering correctly, let’s say you’re using an (11,8) code, then you’re doing 24 multiply and 21 addition operations to calculate the encoded message. You might think of various ways to accelerate this which benefit from a smaller field size, especially if you are doing the calculations on commodity CPUs. The only point I’m making here is that the computation consumes some nontrivial amount of CPU resources, and so it can be fruitful to optimize.

The fact that space probes like Voyager use a large field size is just because the cost of resources is different. For space probes, bandwidth is a precious resource. If you spent $250M on a single space probe, you can afford to use a bit more CPU to encode and decode your puny 100 kHz downlink. On the other hand, if you are running a data center, the power consumption of CPUs is one of your largest costs.

dragontamer · on Jan 7, 2021

> Yes, I am sure that it’s GF(2^4) and not GF(2^16). The problem is not that GF(2^8) or GF(2^16) are “too big”, it’s just that GF(2^4) is faster and uses less CPU.

A multiply in GF(2^8) is just a single multiplication lookup. 256 x 256 == 64kB lookup table. I'm not sure how much faster you can get than a single 8-bit lookup on a table small enough to fit in a typical CPU's cache.

GF(2^8) is so small that you don't even need to do the log-table / antilog-table trick. Just store the entire multiplication table, since the multiplication table fits in L2 cache.

A GF(2^4) multiply would be a single multiplication lookup 16 x 16 or 256 byte (0.25kB) lookup table. Very, very tiny, probably no faster than the GF(2^8) lookup.

> Also note that you’re not just doing one field operation per byte, but several. If I’m remembering correctly, let’s say you’re using an (11,8) code, then you’re doing 24 multiply and 21 addition operations to calculate the encoded message.

That RS(11,8) code (probably shortened / punctured) could be a GF(2^8) code. The question is: do you want to organize your data in 8-bit bytes in GF(2^8), or 4-bit nibbles in GF(2^4)?

I expect that 8-bit bytes is actually faster for modern computers than 4-bit nibbles. Even if you are using a severely shortened / punctured code, there are strong benefits to sticking with 8-bit bytes on a modern computer.

That's the thing: computers usually have 8-bit bytes as the smallest pragmatic unit to use. GF(2^8) lines up with that perfectly, and is used in a variety of high-performance applications (including AES).

> The fact that space probes like Voyager use a large field size is just because the cost of resources is different. For space probes, bandwidth is a precious resource. If you spent $250M on a single space probe, you can afford to use a bit more CPU to encode and decode your 100 kHz downlink.

The Voyager probe launched in 1977, it was with an utter crap CPU by today's standards. Your USB power-cable has more CPU power and RAM than the voyager probe... that tiny USB CPU is used to negotiate the voltage level (5V vs 12V USB power negotiation)

--------

Voyager launched with 32k 18-bit words of RAM and 4MHz for its 6x parallel CPUs (handling different parts of the craft: flight vs imaging unit vs radio unit... etc. etc.). I think you're severely overestimating the strength of Voyager probe, or the level of technology available to 1977.

klodolph · on Jan 7, 2021

> A multiply in GF(2^8) is just a single multiplication lookup. 256 x 256 == 64kB lookup table. I'm not sure how much faster you can get than a single 8-bit lookup on a table small enough to fit in a typical CPU's cache.

64 KB often won’t fit in L1 cache anyway.

> A GF(2^4) multiply would be a single multiplication lookup 16 x 16 or 256 byte (0.25kB) lookup table. Very, very tiny, probably no faster than the GF(2^8) lookup.

Or you can do three operations using a single 16 x 16 x 16 x 16 lookup table, and introduce fewer data dependencies into your CPU pipeline.

The operations work on half as many bits, but complete three times as many steps. Sounds like a win to me.

> I expect that 8-bit bytes is actually faster for modern computers than 4-bit nibbles. Even if you are using a severely shortened / punctured code, there are strong benefits to sticking with 8-bit bytes on a modern computer.

You keep saying this. I understand how CPUs are byte-addressed. When you operate on smaller sizes, you have to do packing / unpacking. If your CPU core spends most of its time doing table lookups, then a little bit of packing / unpacking is nearly free, since your ALUs are otherwise fairly idle, assuming you have a couple registers free.

dragontamer · on Jan 7, 2021

> Or you can do three operations using a single 16 x 16 x 16 x 16 lookup table, and introduce fewer data dependencies into your CPU pipeline.

Gotcha.

So to summarize: you've seen GF(2^4) calculations performed SIMD style. A singular 16-bit lookup can be 4x GF(2^4) lookups in parallel.

Seems legit to me. A non-obvious implementation, but clearly possible and faster.

alanfranz · on Jan 7, 2021

Is there any system for error recovery in userspace? Something like, I’ve got directory DATA, i want to create a recovery file for all the contents in such directory. Ideally I’d like to be able to tune the size of the recovery, larger recovery = recover more errors

dmayle · on Jan 7, 2021

What you're looking for is called Parchive. It uses Reed-Solomon coding in order to generate redundant files that can be used to recover the original. Look for PAR2 utilities (the current major version of the format). It has multiple implementations, open source and closed, across platforms and architectures.

https://en.wikipedia.org/wiki/Parchive

thomasmg · on Jan 7, 2021

Another option is to use an IBLT (Inverted Bloom Lookup Table). it is easy to implement. I recently wrote a file repair tool that uses it: https://github.com/thomasmueller/tinyStats/blob/master/src/m... (not production ready)

brnt · on Jan 7, 2021

I wrote a gui for it, tot help deal with many files being in various states: https://github.com/brenthuisman/par2deep

auxym · on Jan 7, 2021

We used to use this when creating backups on CD-Rs and later DVDs, before hard drives became cheap...

https://en.wikipedia.org/wiki/Parchive

bestham · on Jan 7, 2021

I used a program named Dvdisaster for the same use case. You could use it I two modes: (1) One disc with additional out of band reed-solomon ECC. Create an iso that are smaller by a certain percentage than the medium you want to burn it to. Dvdisaster will create a new ISO that uses the full capacity of the medium and fill the remaining space with parity-information that protects the whole file system and all user files. (2) Put your parity information on the next disc you burn. A parity file is crated for the first disc you burn, this is later added to the file system of the second disc you burn. Repeat indefinitely. You always have the last yet-to-be-burned parity file on your computer and can thus recover bit-errors in every disc you have burned.

dylan604 · on Jan 8, 2021

We used to use this in Newsgroup days. Large files got broken into smaller chunks (maybe floppy disk sized??) in the RAR format. You could recover missing files if they expired before you get them downloaded on your dial-up modem or your particular newsgroup service wasn't reliable enough.

sliken · on Jan 7, 2021

This isn't a turn key utility, but it's relatively easy to call from Go. It allows you to set the number of data shards and parity shards, it's pretty fast. It's at

  https://github.com/klauspost/reedsolomon

Seems like a pretty common solution, I suspect there's other libraries for popular language/platforms.

redis_mlc · on Jan 7, 2021

I'm a storage engineer.

There's a huge difference bewtween theory and practise.

- rebuilds can timeout

- load can really cause timeouts

- raid controllers can have tested modes that work, and some that don't

- batches of drives can all fail at the same time, esp. during rebuilds

- the nvram might not be portable across controllers.

mchristen · on Jan 7, 2021

More than once I've lost a RAID array during a rebuild. That's what backups are for, I suppose.

klodolph · on Jan 7, 2021

This is to be expected. RAID 5 is not safe. It is common for a RAID 5 array to be running in a degraded state during “normal” operation without the operator’s knowledge. Then, when one drive fails, rebuilding the array is impossible; you have already lost data.

This is common because there is often no good way to repair a RAID 5 system which is running in degraded state, and there is often no monitoring to respond to array degradation (if you can’t fix it, why monitor it?)

In other words, RAID 5 does not protect very well against drive failures. RAID 6 is more durable but inefficient (there are systems which are both more durable and more efficient than RAID 6). RAID is generally optimized for implementation simplicity over other concerns.

parliament32 · on Jan 7, 2021

I'm having trouble with some of what you're saying. Why is RAID 5 any different than, say, RAID 1? You monitor them both the same way (your raid controller tells you it's degraded), and you fix both the same way (slam in a new drive and rebuild). Why would monitoring/recovery be any different than any other RAID level on a controller?

dylan604 · on Jan 8, 2021

RAID 1 is just direct copies of the data onto 2 separate drives. No parity is involved. RAID 5 breaks the data into N-1 segments and then has parity data on the remaing disc. The drive the parity data gets saved to rotates through the drives in the array so that no single drive has all of the parity data. That's what RAID 3 does.

So, if you have a RAID 5 array where 1 drive dies, the rebuild process is meant to be able to recreate the missing data on the new drive using the parity data from the rest of the drives in the array. However, if one of those drives has read errors, typically CRC type errors, then the rebuild cannot continue because the data it needs is not available. Bye bye data!!

Also, RAID 5 can achieve higher throughput since the data is split between the number of drives in the array. We used to build RAID 51 volumes which is 2 separate RAID 5 arrays, but then add the 2 volumes as RAID 1 volume so they are mirrored. The system only sees 1 volume with the capacity of a single RAID 5 volume. That was before RAID 6.

tutfbhuf · on Jan 7, 2021

> there are systems which are both more durable and more efficient than RAID 6

RAID 10?

speakeron · on Jan 7, 2021

RAID 10 is more efficient (if that equates to performant) than RAID 6, but not more durable. In fact it's less.

As an example, a 4-disk array (which has the same capacity in both RAID 6 and RAID 10) has 4 2-disk failure modes. RAID 6 can handle all of these failure modes without losing data, whereas RAID 10 can only handle 2 of them.

klodolph · on Jan 7, 2021

Efficiency is how much overhead you have. RAID 10 has 50% efficiency (50% of your storage is used for data), RAID 6 has N/N+2 efficiency. Since RAID 10 means using 4 drives, RAID 10 is never more efficient than RAID 6 for the same number of drives.

The three main things we can optimize for are storage efficiency, I/O performance, and data durability. For most situations, RAID 6 is not Pareto-optimal. In other words, you can come up with solutions that have better storage efficiency, better I/O performance, or better data durability, without sacrificing anything.

The only reason you’d use RAID 6 is because it’s easy to use. It’s measurably worse on any other axis. Combine this with various low-quality RAID implementations and RAID is even worse. Software RAID in the Linux kernel is fairly robust but there are many low-quality hardware RAID implementations.

jandrese · on Jan 7, 2021

Efficiency is not just drive efficiency but controller time. The 4 disk RAID1_0 array may have the same drive space overhead as the RAID6 version, but it may be much slower due to the controller overhead.

One of the often overlooked aspects of RAID is that the more complexity there is in the controller the higher the risk of losing your data due to hardware/firmware/software error. The worst part is these errors often don't appear until your array is already compromised in some way, so everything is fine until a single disk failure triggers a previously unknown bug that causes the rebuild procedure to clobber all of the data on all of the drives.

I've known some professionals who use RAID1_0 whenever they can get away with it because they don't trust RAID controllers to handle anything more complex.

klodolph · on Jan 7, 2021

> Efficiency is not just drive efficiency but controller time.

Sorry, I should be clear. When I say efficiency, I mean “storage efficiency” only. Controller time I count as a part of performance. Perhaps my comment makes more sense with that definition in mind.

Dylan16807 · on Jan 7, 2021

RAID 6 is pretty darn simple, I'm not sure what controller issues you'd have.

That said, RAID 10 is great for many professional use cases. It's very fast, scores well on reliability, the cost premium isn't too bad, and you need backups anyway.

sliken · on Jan 7, 2021

1st lesson on using a RAID should be that RAID is not a replacement for backups.

layoutIfNeeded · on Jan 7, 2021

Edit: I was wrong

bfuclusion · on Jan 7, 2021

Correct me if I'm wrong, but I think that scheme will protect you from at most one erasure.

Fail D1 and P1 There's no way to recover D1s data since P2 has no data from D1 in it.

jakubp · on Jan 7, 2021

There is no mention of "P1" in the entire article. What did you mean?

layoutIfNeeded · on Jan 7, 2021

Actually, you're right!