
Duplicacy: Lock-free deduplication cloud backup tool, with “fair source” license - acrosync
https://github.com/gilbertchen/duplicacy
======
kefka
Reported to Github, as Commercial software masquerading as various open free
license projects (MIT, GPL, BSD, etc.).

Also, intentional namespace pollution with existing backup tool, which IS
gpl'ed.

Not cool. Not cool at all.

____________________________________

(response, since I'm submitting 'too fast'... ):

Github has commercial repos, and private repos.

It's pretty simple, really. If you want the free options on GH, you choose
from a list of standard Open Source licenses.
[https://github.com/blog/1964-open-source-license-usage-on-
gi...](https://github.com/blog/1964-open-source-license-usage-on-github-com)

It's also asked you create a LICENSE file, to go along with this.

Their license, however, is very much NON-FREE. As in, if I click clone, since
I work for an employer of 50k people, I'm in violation. Full stop. And we're
not even talking about developing on it, or submitting PR's, or what have you.
This is simple copy which puts me in violation.

It's very much against the spirit of GitHub, and probably against the license
on GH as well.

And it also is attempting to dilute another project that does similarly. Just
so happens they're 2 letters different. Duplicacy vs Duplicity. That's an
asshole thing to do.

Here's a few names I just devised: ClouDuplicate , Clouder, DupliCloud, CfC
(cloud file cloud)..

Instead, it's very uncool to try to pollute an existing namespace of the same
thing. Talking about pro-level bad will here.

~~~
vertex-four
Wait, how's it masquerading as what?

~~~
nerdponx
The name is very similar to the GPL software Duplicity, and the license is not
a free software license.

~~~
vertex-four
While I'd agree it's similar, it's not the same.

Public projects on GitHub do not need to be under an open source license -
there is absolutely no requirement for that anywhere.

~~~
nerdponx
Right, I was just explaining the "masquerading".

------
davexunit
Note that the "fair source" license is a proprietary software license that
happens to sound like a free software license.

------
bascule
Some claims:

"It is the only cloud backup tool that allows multiple computers to back up to
the same storage simultaneously without using any locks (thus readily amenable
to various cloud storage services)"

"What is novel about lock-free deduplication is the absence of a centralized
indexing database for tracking all existing chunks and for determining which
chunks are not needed any more. Instead, to check if a chunk has already been
uploaded before, one can just perform a file lookup via the file storage API
using the file name derived from the hash of the chunk."

Tahoe-LAFS's immutable file model (based on convergent encryption) was capable
of doing this same thing a decade ago, and also features a pretty nifty
capability-based security model:

[https://tahoe-lafs.org/trac/tahoe-lafs](https://tahoe-lafs.org/trac/tahoe-
lafs)

~~~
acrosync
Naming chunks by their hashes is not a new idea, but this technique along does
not give you a practical backup tool. The deletion of unreferenced chunks
becomes a hard problem, and the center piece of lock-free deduplication is the
two-step fossil collection algorithm that solves this hard problem.

~~~
bascule
Tahoe-LAFS supports a mark/sweep-style garbage collection algorithm

------
acrosync
Developer here. Duplicacy is built on the concept of Lock-Free Deduplication
([https://github.com/gilbertchen/duplicacy/blob/master/DESIGN....](https://github.com/gilbertchen/duplicacy/blob/master/DESIGN.md)),
which allows it to backup multiple computers to the same storage without using
any locks. Currently it supports local or networked drives, SFTP servers,
Amazon S3, Backblaze B2, Microsoft Azure, Google Cloud Storage, Google Drive,
OneDrive, Dropbox, and Hubic.

I recently released the source code under the Fair Source 5 License
([https://fair.io/](https://fair.io/)) which means it is free for individuals
or businesses with less than 5 users. Otherwise the license costs only $20 per
user/year.

Questions and suggestions are welcome.

~~~
koolba
From the link to the license:

> Fair Source has the power to _promote diversity within the developer
> community_. To date, contributing to open source has been an expensive
> proposition for developers. You have to have a stable income and a lot of
> extra time to work on side projects for free, which means talented
> developers from underprivileged backgrounds often aren’t able to contribute.
> Fair Source allows developers to monetize their side projects, which means
> more people can afford to join the ranks of developers who pursue these
> initiatives.

I find it funny that some people feel a need to justify charging money for
something by coming up with bogus social justice rationalizations.

~~~
actuallyalys
I agree that people don't _need_ a justification to charge money, but the
rationale isn't bogus—it is hard to contribute without a stable income,
underrepresented groups in tech tend to make less money in general, and
getting paid could help that.

I'm not sure this license is the way to go, though. Unusual licenses tend to
turn people off, and it's not clear how profits from this license would go to
contributors.

------
whyagaindavid
The name is too similar to Duplicity; do you mind renaming?

~~~
acrosync
I didn't want to sound like duplicity intentionally, but duplicacy.com was
still available at that time and I thought it was a perfect name for a backup
tool...

~~~
CogitoCogito
For what it's worth, I thought this was Duplicity until I read the comment
even after I did a quick glance at the github repo. Since you're both in
backup, this is going to be very confusing to people.

~~~
btschaegg
Same here.

Edit: Come to think of it, it seems quite funny how many backup solutions are
named akin to this, while, under the hood, they actually go through great
lengths to actually get rid of duplicates. Maybe a name deriving from
"condensing" or "shelving" would be more accurate? ;-)

------
zekevermillion
I think I understand the goals of the "fair source" license but why not make
it copyleft all around, and just sell hosted version to small biz, and license
exceptions to corporate clients?

~~~
wmf
Nobody needs the exception so they won't buy it. That business model only
works for copyleft libraries.

~~~
gant
Depends, if it's something likely to be customized, AGPL might work.

------
dom0
The verdict of the "open source competition" in Duplicacy's README is not
entirely accurate. Exclusive locking in the sync'd approach is just the
easiest implementation, not the sole possibility. I can't speak for other
tools, since I do not know their internals well enough, but I can say about
Borg ([http://www.borgbackup.org/](http://www.borgbackup.org/)) that there is
no _inherent_ issue in running the important parts of making backups (i.e.
uploading and deduplicating data) in parallel. It's just not implemented.

Cloud storage back-ends are a somewhat similar story. It wouldn't be that
complex, although locking is a problem due to the EC model of most of these
services. Plans have existed for quite some time now to enable this — just no
time to implement them, and other features are requested more frequently.

~~~
acrosync
I might be wrong but I want to hear more from you if you're a Borg developer.
My understanding is that you may be able to have multiple clients uploading
chunks at the same time, but you won't be able to exploit cross-client
deduplication if different clients have a similar set of files (OS files or a
large code base for instance). Moreover, if your implementation require locks
then it would be very hard to extend to cloud services.

~~~
dom0
Yes, that's right, _concurrent_ addition of the same chunks would generally
mean that some work is wasted; so concurrent long running jobs would not
synchronize well in this model, and lock-free performs clearly better there.

The only operation which inherently has to be guarded by a lock in Borg is
inserting the archive pointer into the manifest (root object, see
[https://borgbackup.readthedocs.io/en/latest/internals/data-s...](https://borgbackup.readthedocs.io/en/latest/internals/data-
structures.html#the-object-graph)). I suppose it would be possible to work
around that without locking or to use the usual hacks around EC,
put/get/check/get/check?put/get/check?put etc. until it's "probably there".

Deleting / pruning archives would still require a full lock due to the same
conceptual issues that your two-phase GC avoids. The same goes for "check".

------
Mister_Snuggles
For about the last year or so I've been looking for an online backup system
with the following requirements:

\- Off-site storage, preferably not costing too much.

\- Option for on-site storage (e.g., to store a backup "in the cloud" and on
my NAS)

\- Keeps version history, with the associated goodies (purging old backups,
etc)

\- Able to run on FreeBSD and Linux, with Windows and MacOS being nice to have
but not required.

\- Able to back up multiple machines to one account.

I strongly suspect that my solution will involve two separate things - one to
actually do the backups and another for the storage.

So far, not having looked at Duplicacy, I'm leaning strongly towards
attic/borg with rsync.net for off-site storage. At first glance, Duplicacy
looks like it will meet my requirements so I will have to give it a closer
look before I pick a solution.

~~~
StavrosK
You just need Borg. Here's a post I wrote about it (as you say, Borg and
rsync.net):

[https://www.stavros.io/posts/holy-grail-
backups/](https://www.stavros.io/posts/holy-grail-backups/)

I have posted it to maybe help a few people who want to do backups:
[https://news.ycombinator.com/item?id=14507656](https://news.ycombinator.com/item?id=14507656)

~~~
Mister_Snuggles
I currently use Attic for backups going to onto my NAS, so one plus for
attic/borg is familiarity. I figure that if I'm going to go with rsync.net,
I'll switch to borg since it's (as you point out) better maintained.

Are you using rsync.net's "hidden" attic/borg option? This makes the price
very attractive.

You mention using "attic check" to guard against bitrot on the provider's
storage. How is this in terms of bandwidth used? Does it have to transfer
every byte or does it compute a checksum on the encrypted data (since
rsync.net doesn't have the raw data) and just send that?

~~~
StavrosK
> Are you using rsync.net's "hidden" attic/borg option? This makes the price
> very attractive.

I am, yes, and it is quite attractive.

> You mention using "attic check" to guard against bitrot on the provider's
> storage. How is this in terms of bandwidth used? Does it have to transfer
> every byte or does it compute a checksum on the encrypted data (since
> rsync.net doesn't have the raw data) and just send that?

It's very bandwidth-efficient, but I have stopped doing that every day, as
rsync.net told me they use ZFS and scrub their arrays regularly, so they would
discover bit rot early. I only run the check once a month now.

------
someonewhois
How is this related to the other Duplicity backup software?

[http://duplicity.nongnu.org](http://duplicity.nongnu.org)

~~~
dom0
Not related at all. (Duplicacy != Duplicity)

Duplicity is a pretty straight good old-fashioned incremental backup program.

Duplicacy on the other hand is hash-based deduplication (BorgBackup / Attic,
Restic etc. are some others).

The design of Duplicacy is slightly different from that of e.g. BorgBackup.
Duplicacy, as the title says, uses a _lock-free_ approach. BorgBackup and the
handful of open source tools in the same spirit use a synchronized approach.

------
bmaranville
I had been using rclone ([https://rclone.org/](https://rclone.org/)) for
Amazon S3, which has some of the same features but recently the application
key was blocked by Amazon. Is duplicacy safe from the same fate?

~~~
acrosync
I think Amazon only blocked rclone's application key for Amazon Drive. There
is no way for Amazon to prevent a third-party application from accessing S3,
since users provide their own S3 credentials and Amazon doesn't know who is on
the other side.

------
voiper1
Can someone explain how it's able to make small updates, e.g. to s3? How does
it know what's already there -- cache? How does it prune old chunks -- will
there be tons of individual API requests to S3?

~~~
acrosync
We use a pack-and-split approach -- files are packed first (as if it is
building a big tar file first, although this is only conceptual) and then
split into chunks using a variable-sized chunking algorithm. You can customize
the chunk size but by default the average chunk size is 4MB so you won't be
uploading too much small files.

------
kierenj
Played with it for 20 minutes, really struggled with the GUI/UX. Have raised
some issues (really wanted to like it), but just feels really clunky for
something asking for cash.

------
leni536
Its encryption scheme and threat model seems to be similar to cryfs's [1].

[1] [https://www.cryfs.org/](https://www.cryfs.org/)

------
bebopfunk
What is the encryption standard being used for file encryption?

------
X86BSD
The website makes it sound just like Tarsnap. Am I wrong? Is there some
compelling feature I am missing that would make me want to switch from
Tarsnap?

~~~
wmf
AFAIK Tarsnap backups go through Tarsnap's server but every other backup tool
seems to not have that requirement.

~~~
j_s
Does this mean Tarsnap de-dupes before encrypting? That doesn't seem to make
sense but I don't see any other reason going through their server would be
required.

~~~
Scaevolus
Tarsnap uses content based hashing too. The pipeline is basically: tar | chunk
| encrypt | upload-new-chunks

The tarsnap server provides a transactional KV store-- "In order to create a
new archive, the tarsnap client sends a "write transaction start" request,
many "write data" requests, and a "commit transaction" request to the tarsnap
server; deleting an archive is similar (except with a "delete transaction
start" and "delete data" requests)."
[http://www.daemonology.net/blog/2008-12-14-how-tarsnap-
uses-...](http://www.daemonology.net/blog/2008-12-14-how-tarsnap-uses-
aws.html)

