
Finally Bitcasa CEO Explains How The Encryption Works - eljaco
http://techcrunch.com/2011/09/18/bitcasa-explains-encryption/
======
callahad
Something is twitching in the back of my mind about this. Sure, they can't
look at the data based solely on the encrypted copy, but if they have a
plaintext copy of a document of interest, they are able to determine which of
their customers has that document, right?

Doesn't that diminish some of the privacy claims?

~~~
JoachimSchipper
Sure, but known-plaintext attacks are not the worst part. Consider this [found
via [http://www.mail-
archive.com/cryptography@metzdowd.com/msg089...](http://www.mail-
archive.com/cryptography@metzdowd.com/msg08982.html)]: I take the standard
Wordpress config.php [for your host], fill in your site and account name, fill
in the one million most common database passwords, and ask the cloud provider
whether any of these hashes exist.

Or: I create a form (say .doc) with a single field, CC#, and hope people store
this. I then check the existence of 10^11 hashes to find (all customers'!)
credit card numbers (for a specific issuer). This takes only a CPU-day! (The
network is obviously slower.)

~~~
Locke1689
Eh. OK let's say instead of just SHA-256'ing the plaintext data to derive a
key you do 50,000 bcrypt rounds. Then the client encrypts the plaintext,
hashes the ciphertext, and sends the hash to the server. If it takes 0.5 s to
generate a single bcrypt key, it would take about 1,500 years to find a single
credit card number.

~~~
JoachimSchipper
Sure, but then it takes 0.5s per file to check whether the server and client
are in sync, too.

------
maaku
TL;DR: AES_key = SHA-256(file)

This does introduce new avenues for attacks, however. You don't have to be
able to decrypt to show that certain people have certain files.

Also, for files that contain just one piece of sensitive information and a the
rest is predictable (i.e, the secret key file for a website back-end), you've
effectively given up a hash of the secret which can then be brute-forced.

~~~
SODaniel
As long as you don't need to decrypt with user defined key to compare data, it
is NOT zero-knowledge. Plain and simple.

------
xtacy
This thread has a lot of discussion related to "convergent encryption."

<http://news.ycombinator.com/item?id=2570538>

EDIT: <http://news.ycombinator.com/item?id=2461713> as well

EDIT2: Actually, there's more to this problem than just convergent encryption.
If the storage provider knows which encrypted blobs belong to you, it can
encrypt _some_ file and still figure out which users have copies of it. So,
the storage provider, which stores a collection of encrypted blobs, should not
know the blob -> list(users) association. I don't know if Bitcasa addresses
this part.

------
nikcub
My biggest issue (beside the initial TC article being a complete shocker) was
the claim of 60% saving on de-duplication and that each user only had 25GB of
unique data.

This research paper from Microsoft on Farsite[2] claims 'up to 50%' saving on
de-dupe with a convergent file system - but that was tested against 500
computers in a corporate environment and it was done back in 2002.

Users now store a lot more photos, a lot more of their own video, and any
content that is DRM'd is also unique. You can save on operating system and
application files, but it isn't 60%.

There is nothing 'finally' about this additional information. The discussion
and criticism of the claims on Twitter was knowing this information about
convergent encryption and the key being derived from the content. There is a
lot more that is still unanswered - such as how an 'intelligent cache' allows
'unlimited' storage to be available offline.

I really wish these guys would release a research paper with their results, or
include more information on their website before they make such bold claims in
public.

[1]
[http://research.microsoft.com/apps/pubs/default.aspx?id=6995...](http://research.microsoft.com/apps/pubs/default.aspx?id=69954)

~~~
trotsky
_any content that is DRM'd is also unique._

In most cases this isn't true. The computation involved in keying media on the
fly while it's being downloaded is not insignificant when considered in
volume. The added pain of storing everyone's unique keys also discourages this
behavior. At worst you'll see different keys being used by region or
datacenter, or perhaps key rotation on a weeks-months scale.

Some media (both DRM and non-DRM) will be trivially unique because of metadata
like purchaser info or music tags. In some cases this makes the first block
unique but all later blocks are deduped, in other cases you need to be
somewhat content aware so you can treat the header data separate from the real
media data. This also allows you to catch a lot of data people ripped
themselves using standard settings.

 _You can save on operating system and application files, but it isn't 60%._

While I agree that photos and videos will be the bulk of their problem, I
don't think that ruins their premise. The question is if their userbase will
be significantly overweight on heavy media creators. If it's a standard
distribution, I wouldn't be surprised if a majority of people were under 10gb
unique and 70%+ deduped.

~~~
nikcub
I tested it on iTunes. bought the same television episode on two different
accounts on two different laptops and compared them to find that they had 4-5
bytes in common.

Not sure how it works in WMP. It would be common in blu-ray rips, though.

------
andrewcooke
it's important to note that this is not strong against knowledge of the
plaintext. that's kind-of obvious, when you think about how it supports de-
duplication, but perhaps an example will clarify why you might be concerned.

say you want to backup some data. and that data includes music or video... and
the riaa or mpaa decide that bitcasa are facilitating pirating and should be
shut down... so they reach a deal where all the data are checked against known
songs or videos. and if they find a match then your identity will be provided
for prosecution...

of course, if you are doing nothing wrong, you have nothing to fear. this can
only identify known data. but even so, it is an interesting issue:
"encryption" here doesn't have all the guarantees you might expect.

(there are more disturbing scenarios too. for example, perhaps a certain text
is not illegal in the copyright sense, but is unacceptable politically.)

[disclaimer - this is from skimming the paper; i should say that i am no
expert on this, so don't take my word as gospel]

------
nextparadigms
_"HP: What do you do in terms of encryption or security?

TG: We encrypt everything on the client side. We use AES-256 hash, SHA-256
hashing for all the data.

HP: So it’s encrypted all on the client side and you can’t look at it on the
server side?

TG: Exactly"_

Finally, a company that gets it. I've been asking for this for a while now. I
wish Dropbox and all the others would do this, too. I get it that some of
Dropbox' customers may not want to deal with the encryption on the client
side, but they should at least offer the option to everyone, and it should be
right there every time someone wants to upload something. It would be best if
it was the default option, too.

This way they won't get into the mess they got into last time with the feds
asking for user data, and the clients who want full security of their data
won't have to be worried about it anymore.

~~~
rednaught
In addition to Wuala, Spideroak does this as well.

A problem remains "with full security" in that you have no idea what's going
on in the binary client program. Reveal or open-source the client program and
allow customers who need this end-to-end security to compile the program
themselves.

~~~
SODaniel
We at SpiderOak in fact do not cross-account deduplicate AT ALL and provide a
full zero-knowledge environment with no access to client side encryption key
info.

We feel that the possible cost savings involved with deduplicating data across
user accounts is just not worth the inherent security risks.

------
lisper
Academic paper on convergent encryption:

<http://www.ssrc.ucsc.edu/Papers/storer-storagess08.pdf>

TL;DR version: take a chunk of data, encrypt it with its own sha1 hash as the
key. Now you have an encrypted version that you can dedup. You can only
decrypt if you already know the hash. Info about who owns any particular chunk
is not kept on the server, so even if you break in to the server, all you can
tell is which chunks correspond to data you already possess. Seems plausible.

~~~
sp332
The list of "who owns which hashes" must be stored on their servers, even if
it's not the "same" server. Otherwise I would have to manually transfer my
hashes from one computer to another.

~~~
lisper
Well, OK, but that data can also be convergently encrypted, so you only have
to transfer the hash, not the whole list. But your point is well taken. If you
can get your data from a different machine with nothing but a user name and
password, that's probably a security hole.

------
gst
Nothing new here. Same technique has been used by Wuala for years now.

~~~
morsch
Huh, I was going to cry foul, but apparently you're right and that is exactly
what they are doing. For reference see e.g.
[http://wualablog.blogspot.com/2011/04/wualas-encryption-
for-...](http://wualablog.blogspot.com/2011/04/wualas-encryption-for-
dummies.html)

------
mmaunder
I would argue that you can either have data de-duping or encryption, but not
both.

If encryption is defined as: Transforming data so that only people with
special knowledge can read it.

Then if you can compare a chunk of encrypted data against another chunk to
determine the source data...

Well now you have very weak encryption because you could brute force it if you
have a large enough repository of user files.

~~~
maaku
While technically correct, that's not a practical observation. A memory bank
storing your “large enough repository of user files” would consume the entire
universe.

That said, people don't store random bitstrings. People store music on these
shared storages--if I were a big media company I could find all the MP3's of
songs I own floating around P2P networks, compute their encrypted forms and
subpeona the storage company for user accounts storing any one of the the
files. People have also been known to synchronize application data, including
files with secret keys or passwords, which in this case effectively shares a
hash of the password. That's better than dropbox, but still if the key +
normal file variation doesn't have enough entropy an attacker could brute-
force the contents of the file.

EDIT: Those are just potential real-world attacks I can think of on the spot;
I'm sure there are plenty of others. While this is certainly (marginally)
better than Dropbox, real security and data de-duplication are mutually
exclusive.

------
rubyorchard
Encryption provides confidentiality in a secure system. Convergent encryption
doesn't fit that bill.

------
joshu
Why is dedupe so important?

I have to imagine this mostly helps with OS files that are standard across man
machines. Can't we ship a list of hashed client-side?

~~~
SODaniel
Cross account deduplication has nothing to do with making improvement client
side, it's all about keeping storage costs down for the provider.

The only type of deduplication that matters to consumers is account specific
deduplication, and that saves no data for the provider unless you charge for
non deduplicated storage.

~~~
humbledrone
Cross-account deduplication does have a client-side benefit: duplicate files
don't need to be uploaded. E.g. your 300 GB iTunes library might sync to the
server in a couple of minutes, rather than days.

------
esutton
basically the argument is that this is an encryption algorithm that is
deterministic as there is no randomness, after the initial value. This sounds
more like a Random Oracle, <http://en.wikipedia.org/wiki/Random_oracle>. which
by the way don't exist

------
grimtrigger
So can Mark Zuckerberg sign up for Bitcasa and store all of Facebook's photos
there for $10 a month?

------
eljaco
Curious to hear if anyone has experience with this "convergent encryption."

~~~
cHalgan
Microsoft did something here: <http://research.microsoft.com/en-
us/projects/farsite/>

------
naner
So I can encrypt a file, upload it, and if someone else encrypts the exact
same file... they can decrypt my uploaded file? I'm having a hard time
wrapping my head around this.

~~~
charlesju
I believe that the encryption will encrypt to the same signature everytime.
So, if someone else uploads the same encryption file, they just point your
file to someone else's same encrypted file.

