

Why SpiderOak doesn't de-duplicate data across users - rarrrrrr
https://spideroak.com/blog/20100827150530-why-spideroak-doesnt-de-duplicate-data-across-users-and-why-it-should-worry-you-if-we-did

======
count
While that may be why SpiderOak does what they do, I disagree with most of
their arguments?

First, you can do variable block-based de-duplication, which is how major
storage vendors do it - not per-file, which doesn't really buy you much.

Leveraging this in the SAN firmware also prevents their ridiculous file
transfer 'vulnerability' (which only exists due the way they wrote their
software) - all of your files copy over the network to the storage system.
Once they're on the storage system, at some later time and asynchronously, the
storage system runs a dedupe on the blocks, and winnows down its storage.
Think of it like transparently compressing on the storage side, only,
hopefully, less intense on I/O.

Finally, they could simply encrypt everything and then they can't answer
subpoenas about who has knowledge of what - it was all encrypted immediately
after uploading, and no log is kept. If the data doesn't exist, you can't be
forced to give it up...

~~~
kijinbear
IIRC, SpiderOak doesn't encrypt your files "immediately after uploading".
Files are encrypted on the client side even before they're uploaded. Therefore
SpiderOak never even sees unencrypted data. That's their "zero knowledge"
policy. They think it's better to make it impossible for themselves ever to
have any idea what files they're hosting, rather than saying "We used to know
X seconds ago, but we don't know anymore."

The blog doesn't mention whether or not SpiderOak uses block-level
deduplication. Maybe it's part of their storage infrastructure, maybe isn't.
But all that client-side encryption would severely reduce the number of
duplicate blocks even if everyone uploaded the same file.

------
houseabsolute
> In a large enough population of data, collisions happen.

So you pick a hash function whose space is so large that the risk of collision
is less than the risk of any other possible reason for accidental mis-
identification (like all the file's bytes spontaneously switching to the
collided file's bytes). You can have databases with trillions of objects with
less than a one in a million chance of collision in a 256-bit space.

Realistically, there are a lot of things that are better to worry about than a
one-in-a-million chance of losing a file. And if you really need so many
objects that that's not enough, just increase the size of the hash space.

~~~
rarrrrrr
Yes; if you increase the hash space sufficiently, these problems go away. I
don't think wide hashing has really become a standard industry practice
though, because services want to pick the option that is least burdensome to
end users' CPUs. Another issue is that once they have a big de-duplication
database established based on a particular hash, switching is expensive. I
suspect a lot of shops are using md5 still.

~~~
houseabsolute
Perhaps. That said, even a SHA-512 sum on my computer seems to take a little
less than a CPU-second per hundred megabytes. Odds are you're not going to be
uploading that fast, so you should be able to do that work "online" and not
have a noticeable impact on either upload throughput or user-visible impact.
This is doubly true on a multi-core device, since the sha sum I quoted was
single-threaded. I would think the more important thing to minimize is user-
impacting disk latency from the backup scan.

------
klodolph
"In a large enough population of data, collisions [of cryptographic hash
values] happen."

True, but this does not happen in practice. Using SHA-1 as an example hash
function, you'd need about 10^24 different files before you would expect a 50%
chance of a collision. You are not going to come anywhere near this limit,
10^24 different files have not been created during the entire span of human
history.

~~~
rarrrrrr
... and yet SHA1 collisions have happened. According to NIST, "Federal
agencies should stop using SHA-1 for...applications that require collision
resistance as soon as practical"

Regrettably, we don't actually have any truly ideal hash functions yet.

~~~
tptacek
Not that any of the rest of this argument makes much sense, but SpiderOak
(says it) uses SHA256. There are no induced collisions on the horizon for
SHA256, and there won't be accidental ones in this data set.

------
vimalg2
Interesting... Dropbox apparently does this.

This explains why adding the ubuntu netbook iso image to the Dropbox folder,
synced in milliseconds.

My friend was going a little nuts wondering if Dropbox choked.

------
naner
If you've got a lot of pirated media use SpiderOak, otherwise use DropBox. Got
it.

~~~
Axelnm
If you leave comments just to troll, do it often and stupidly. Got it!
Disappointed over missing dropbox.com referral link :-(

~~~
naner
I don't actually have a dropbox account, I was just being an ass. Probably
shouldn't have posted that to begin with. I was in reddit mode.

