
Zfec – Efficient, portable erasure coding tool - Tomte
https://github.com/tahoe-lafs/zfec
======
sliken
Anyone know of a comparison of similar tools. I've tinkered with this one:
[https://github.com/klauspost/reedsolomon/](https://github.com/klauspost/reedsolomon/)

I believe it's compatible with the Java implementation used by Backblaze and
others.

~~~
vu3rdd
One of the authors of Tahoe-LAFS and the original creator of zfec, Zooko, co-
authored a paper on the performance of such codes.

[http://nisl.wayne.edu/Papers/Tech/code-pf-
fast09.pdf](http://nisl.wayne.edu/Papers/Tech/code-pf-fast09.pdf)

Luigi Rizzo is the original author of the fec C library.

I co-maintain zfec (and the Debian package of it) along with Zooko. A new
release of zfec is planned to be out soonish.

edit: removing the <> delimiters around the URL. Sorry!

~~~
zokier
HN sadly does not support <> URL delimiters, so your link ends up 404. Here is
it without delimiters:

[http://nisl.wayne.edu/Papers/Tech/code-pf-
fast09.pdf](http://nisl.wayne.edu/Papers/Tech/code-pf-fast09.pdf)

~~~
vu3rdd
Thanks. Updated the link.

------
olavgg
I find liberasurecode more interesting
[https://github.com/openstack/liberasurecode](https://github.com/openstack/liberasurecode)

BSD licensed

Pluggable erasure coding backends, and supports Intel's ISA-L

There is also a Python library available
[https://bitbucket.org/kmgreen2/pyeclib](https://bitbucket.org/kmgreen2/pyeclib)

~~~
notmyname
PyECLib upstream is now at
[https://github.com/openstack/pyeclib](https://github.com/openstack/pyeclib)

------
0x0
Another tool that is the de-facto standard on usenet for erasure coding is
.par2
[https://en.wikipedia.org/wiki/Parchive](https://en.wikipedia.org/wiki/Parchive)

~~~
gwern
I'm not sure PAR2 really counts as erasure coding rather than a more general
FEC. I've tried zfec in the past but it's not really set up for backup uses
such as creating an archive to burn to a set of BD disks.

For backups, what you want to generate additional files, par2 style, which can
be combined with the regular archive files (eg like in duplicity, GPG-
encrypted tarballs) to repair any bitrot or lost fractions of those archives.
But with zfec, as I understood it, it only works if you turn them entirely
into erasure-coded shares and assume you'll lose _entire_ shares and wind up
using m-of-n shares in recovery/restoration (as opposed to PAR2 which will
handle losing individual archive files but also arbitrary corruption in any of
those archive files, as long as the total does not exceed the redundancy %).
So you would have to do something like create a single giant archive file, and
if you wanted the equivalent of 50% redundancy, you'd use zfec to turn it into
'50-of-100' shares and backup the 100 shares (or do 500-of-1000 etc).

Then any shares which aren't bit-identical to get tossed out at restoration
and you hope you have enough shares left to reconstruct the original giant
archive file. But you might not - your shares might be _mostly_ good but
missing a few chunks, which is something a PAR2 setup could cope with (if the
chunks are in the archive files, they get repaired normally, and if they're in
the PAR2 redundancy, they don't matter). This is fine in the Taho-LAFS or
datacenter setup where you assume your storage nodes will store perfectly or
fail entirely (since given any live errors, you can just scrub the entire
machine and rebuild an additional copy from the erasure-coded shares). Not so
much in a backup setting.

Ultimately, while zfec is way faster than par2, I found this setup squirrely
enough that I wasn't convinced to switch.

------
amelius
Slightly offtopic (although the title contains "portable"): it often saddens
me that purely mathematical functions like these are so tightly coupled to
implementation language (even though there are three different language
bindings here). Shouldn't the programming community have advanced past that
point by now?

~~~
jimktrains2
Can you give an example of what you mean?

~~~
JohnStrange
Maybe he has this in mind:

Imagine a high-level, straightforward imperative language X whose only purpose
is to specify libraries and data structures. This X has strong typing, memory
safety and maybe also memory alignment directives but abstracts away as much
as possible from any actual environment/platform/CPU layout. It has a precise
semantics and even some ways of specifying pragmatics (e.g. big O runtime
behavior of functions). In X you write "pure" algorithms that do not access
(m)any OS-dependent features. Libraries in this language are then transpiled
to various target languages and platforms.

To use _all_ libraries that have ever been written for X, someone only needs
to write a transpiler-backend for the target language/platform once.

LLVM intermediate language is similar but too low-level. Think about a high-
level counterpart.

------
tener
While FECs are useful, I really wish there was a free and good implementation
of fountain codes [1].

[1]:
[https://en.m.wikipedia.org/wiki/Fountain_code](https://en.m.wikipedia.org/wiki/Fountain_code)

~~~
nickcw
I believe fountain codes are patented so you are unlikely to see a free and
open implementation.

According to this comment:
[https://news.ycombinator.com/item?id=2684488](https://news.ycombinator.com/item?id=2684488)

Qualcomm now owns the patents.

~~~
tener
Are software patents a thing outside US? Surely plenty of hackers in EU could
use such a library.

~~~
nickcw
I believe that software can be patented in the EU if you have a clever patent
agent...

From wikipedia:
[https://en.wikipedia.org/wiki/Software_patents_under_the_Eur...](https://en.wikipedia.org/wiki/Software_patents_under_the_European_Patent_Convention)

> Under the EPC, and in particular its Article 52,[1] "programs for computers"
> are not regarded as inventions for the purpose of granting European
> patents,[2] but this exclusion from patentability only applies to the extent
> to which a European patent application or European patent relates to a
> computer program as such.[3] As a result of this partial exclusion, and
> despite the fact that the EPO subjects patent applications in this field to
> a much stricter scrutiny[4] when compared to their American counterpart,
> that does not mean that all inventions including some software are de jure
> not patentable.

Going back to the original question, I certainly wouldn't want to start an
open source project knowing that Qualcomm might sue me for patent
infringement. Defending their intellectual property is part of their core
business strategy.

------
progman
In face of cheap TB disks erasure coding seems to be a far better option than
RAID.

[http://www.computerweekly.com/feature/Erasure-coding-
versus-...](http://www.computerweekly.com/feature/Erasure-coding-versus-RAID-
as-a-data-protection-method)

Is there any empirical data which zfec parameters are recommended for which
device? AFAIK reliability is DVD < Blueray < SSD < HD < tape < cloud.

------
speps
Never seen this licence before : [https://github.com/tahoe-
lafs/zfec/blob/master/COPYING.TGPPL...](https://github.com/tahoe-
lafs/zfec/blob/master/COPYING.TGPPL.rst)

Anyone knows more details?

~~~
woliveirajr
Seems to exist since 2010: [https://thunk.org/tytso/blog/2010/01/20/the-
transitive-grace...](https://thunk.org/tytso/blog/2010/01/20/the-transitive-
grace-period-public-licence-good-ideas-come-around/)

Originals version of the license:
[http://zooko.com/tgppl.html](http://zooko.com/tgppl.html)

Seems like the goal was to keep the product under copyright for some time,
giving it a opportunity to recover costs or integrated exclusively inside a
product for a while, giving some advantage over competitors, but making it a
public-license after the period.

Also found this explanation: [https://tahoe-
lafs.org/~zooko/tgppl.pdf](https://tahoe-lafs.org/~zooko/tgppl.pdf)

------
nullc
Zfec is really slow compared to state of the art CRS codes like cm256:
[https://github.com/catid/cm256](https://github.com/catid/cm256)

------
moreati
Another plug for [https://math.stackexchange.com/questions/663643/discover-
par...](https://math.stackexchange.com/questions/663643/discover-parameters-
of-a-reed-solomon-code-from-its-output/1290327#1290327) given n inputs &
outputs of a reed solomon function is it possible to derive the parameters?

------
visarga
Is it like
[https://en.wikipedia.org/wiki/Tornado_code](https://en.wikipedia.org/wiki/Tornado_code)
?

They were used for reliable data distribution, being able to recover from loss
of packets without requesting the packets be sent again.

------
kristianov
Just curious: any similar libraries for the other class of erasure codes, the
fountain codes?

------
zypeh
Can anyone explain to me, what is the usage of `erasure code` ?

~~~
notmyname
As a quick answer, the name comes from being able to recover data when some of
it is "erased".

The only way to durably store data so that it survives a hardware failure
(e.g. drive dying) is to store more than one copy. Full replicas are the
simplest way to do this, but you've got a relatively high overhead (e.g. Store
1GB of data with 3x replicas, and you store 3GB of data). Erasure codes are a
way to effectively store fractional replicas, so you only use 1.5x or 1.7x of
the original data.

Erasure codes are great when you've got a lot of data and you need high
durability but don't want to pay for the storage space required for full
replicas.

Why don't we always use erasure codes for everything? EC isn't great when
you've got small bits of data, and since there's a bit of math involved in
reading and writing the EC data, EC has higher latency than simple replicas.

[https://www.swiftstack.com/blog/2015/04/20/the-
foundations-o...](https://www.swiftstack.com/blog/2015/04/20/the-foundations-
of-erasure-codes/) is a great into to how erasure codes work.

------
est
The project behind it, Tahoe-LAFS deserves some love.

Basically, spread your data across many cloud storage providers and built a
super fast soft-RAID.

~~~
tscs37
I tried Tahoe-LAFS, the setup seems a bit weird and from trying it on that
public gateway a bit, there seem to be some issues in relation to what happens
if you loose that link to your folder.

There seems to be no way of using it to simply attach and have a list of your
data only.

I'm favoring Camlistore over Tahoe, though Camli has no good encryption yet
and no erasure encoding. The basis of Camli seems to be better.

