> During the 6 days the data would be available it just might have to be reconstructed on the fly by the error correcting rather than read directly.
Correct. More specifically, the FIRST time the data is accessed in any 24 hour period it must ALWAYS be reconstructed from the Reed-Solomon encoded parts on 17 other drives on 17 other machines. Any 17 is fine, so it's totally fine if 1 or 2 drives are not available. Once reconstructed it is stored in a set of front end cache computers that have fast SSDs for this purpose.
The second time the same file is accessed in a 24 hour period, it will be fetched out of the SSD cache layer so it won't even hit the spinning drives and won't care if all 20 drives are offline.
> "primary traffic" (read/write stuff) is prioritized over "rebuild traffic"
Yes. Backblaze balances between the two if only one drive has failed, but as a tome (20 drive group spread across 20 computers) becomes more badly degraded Backblaze begins favoring the rebuild. When two drives have failed out of 20, Backblaze stops allowing any writes to that tome because more writes will tend to fail yet another drive. Fewer writes offloads the tome. But we still allow reads. At Backblaze, we have never been 3 drives degraded out of 20 (knock on wood), but if this ever occurs the 20 drive tome is now running without parity -> so in that case we even stop allowing reads AT ALL until we are returned to at least 1 drive of fully redundant parity.
> During the 6 days the data would be available it just might have to be reconstructed on the fly by the error correcting rather than read directly.
Correct. More specifically, the FIRST time the data is accessed in any 24 hour period it must ALWAYS be reconstructed from the Reed-Solomon encoded parts on 17 other drives on 17 other machines. Any 17 is fine, so it's totally fine if 1 or 2 drives are not available. Once reconstructed it is stored in a set of front end cache computers that have fast SSDs for this purpose.
The second time the same file is accessed in a 24 hour period, it will be fetched out of the SSD cache layer so it won't even hit the spinning drives and won't care if all 20 drives are offline.
> "primary traffic" (read/write stuff) is prioritized over "rebuild traffic"
Yes. Backblaze balances between the two if only one drive has failed, but as a tome (20 drive group spread across 20 computers) becomes more badly degraded Backblaze begins favoring the rebuild. When two drives have failed out of 20, Backblaze stops allowing any writes to that tome because more writes will tend to fail yet another drive. Fewer writes offloads the tome. But we still allow reads. At Backblaze, we have never been 3 drives degraded out of 20 (knock on wood), but if this ever occurs the 20 drive tome is now running without parity -> so in that case we even stop allowing reads AT ALL until we are returned to at least 1 drive of fully redundant parity.