
How Dropbox sacrifices user privacy for cost savings  - schwanksta
http://paranoia.dubfire.net/2011/04/how-dropbox-sacrifices-user-privacy-for.html
======
arashf
Hi all, Arash from Dropbox here. We understand the concern that the government
could try to guess whether a particular file has been uploaded to Dropbox
based on processing times and then request that Dropbox identify a user who
has access to that file. However, to seek user content information, the
government needs to comply with the provisions of the Electronic
Communications Privacy Act by obtaining a warrant supported by probable cause
(or in some cases a court order from a judge). Those safeguards protect user
privacy. De-duplication does not make users any more vulnerable to intrusive
government actions. Today, a government agency could ask any online service to
provide the names of all users who have a particular file, whether or not the
service employs de-duplication. And in that case, the government would also
need to support its request with a warrant or court order. The rules that
provide a check against unwarranted government snooping apply to online
services equally, regardless of their back-end architecture.

~~~
sigil
If you don't mind my asking, what is the percentage savings achieved by de-
duplication across all of Dropbox? Some others here have wondered if it was
premature optimization.

~~~
arashf
It wasn't a premature optimization. It was both a better experience for the
user (saves bw/reuploads for the user) and was simpler to implement (can keep
things in one global bucket) given we didn't want things like renames to
trigger reuploads and had to use checksums as a result.

~~~
sigil
But you could prevent reuploads with per-user de-duplication, while avoiding
the privacy issue of cross-user de-duplication.

I could see why this would be more work to implement (you have to key on
user+contenthash), but it would still be interesting to know how much Dropbox
and its users actually benefit from cross-user de-duplication.

~~~
shrikant
I have a hard time understanding this line of argument.

Per-user deduplication will mean, I need not upload the same file twice into
my own account? What's the use of this?

I keep some of my 'paid for' software installables backed up in my Dropbox,
and they tot up to ~1.5 GB (the Humble Indie Bundle games). When I started the
upload however, it took maybe 5 seconds because of cross-user deduplication,
and I am super grateful to them for this feature.

I imagine this feature saves users tons of bandwidth, as most of the people I
know use Dropbox for backing up important software, rare music and videos.

~~~
sigil
> I have a hard time understanding this line of argument.

It's not a line of argument. It's a line of inquiry. You've given anecdotal
evidence that cross-user deduplication benefits you and people you know, but
what about some actual numbers from Dropbox?

Producing actual numbers -- "eg cross-user deduplication saves our users 30%
of their upload time and bandwidth, on average" -- seems like a great way for
Dropbox to counter this issue.

> I imagine this feature saves users tons of bandwidth, as most of the people
> I know use Dropbox for backing up important software, rare music and videos.

We don't have to imagine! Let there be numbers!

Also -- "rare" music and videos that everyone's uploading duplicates of? ;)

------
pacemkr
I think most of us, if we were developing Dropbox, would have made the same
decisions. De-duping at the cost of complete privacy, even in the face of law,
is a sound technical and business decision for a service such as Dropbox.

When I think Dropbox, I think sharing, I think convenience, I don't think
backup and security. For backup I need more space, for security I need to use
my own private key (not a password that one can change/recover). Neither of
these things is offered by Dropbox. And this is the reason why I never
confused Dropbox with, say, CrashPlan. One is a way to share and collaborate,
the other is a place to send my private key encrypted bits to.

My individual privacy is not compromised by somebody being able to say if a
certain file is stored by the entirety of Dropbox user base. The other claim,
that, given a court order, Dropbox can be forced to turn over your files or
tell the court if you store a certain file _may_ be true, but I don't think
Dropbox, the company, has ever promised that level of security.

~~~
earl
It's far from obvious that some crusading DA can pick files and force dropbox
to tell him or her who has copies of that file.

~~~
JoachimSchipper
Dropbox can certainly get that information - remember, they can access all
accounts.

~~~
earl
sorry, that was my point. I use dropbox, and I knew they hashed files, but I
didn't think through the implications.

------
rafd
My take-away is this:

"What this means, is that from the comfort of their desks, law enforcement
agencies or copyright trolls can upload contraband files to Dropbox, watch the
amount of bandwidth consumed, and then obtain a court order if the amount of
data transferred is smaller than the size of the file."

This, I think is significant, especially if Dropbox is advertising security,
privacy and encryption. As the author mentions, the ToS are being updated to
reflect the above possibility ("if Dropbox receives a warrant, it has the
ability to remove its own encryption to provide data to law enforcement").

~~~
HerraBRE
On the face of it, the security of DropBox encryption seems comparable to that
of a padlocked room - where the key to the padlock is kept hidden in a safe
place.

Except the key is actually kept under the doormat, because people keep going
in and out and keeping it elsewhere is just too inconvenient.

If you told someone a room like that had military grade security, you would be
called a liar. No matter how fancy the padlock. Without knowing the details of
DropBox's setup I'm going to refrain from calling them liars, but this all
seems pretty fishy to me.

~~~
vladd
The analogy you mention is not very good because the hole discovered in their
security model is related to duplicate content identified via the same hash
value. If you have unique content that's different than anything else, even by
1 bit, then you're secure until someone uploads exactly the same content (this
is due to the way in which hash functions work); meanwhile, a key under the
doormat would imply a totally different security threat model.

~~~
HerraBRE
The downvotes I'm getting seem to indicate others agree I'm off the mark here.
Perhaps I was too inflammatory, perhaps people just don't understand my
analogy. Let me try again without the hyperbole. :-)

My point is, DropBox advertise proudly on their website that they use military
grade encryption to protect their users' data. However, it has now been
independently shown now that the keys to this data are in DropBox's direct
possession and are in routine, daily use, decrypting one person's data so
another can access it (this is what happens when deduping allows you to
download something you never actually uploaded yourself).

To me, this implies that their claims of "military grade security" may be
unjustified and just yet another example of security theater in the cloud.

Without knowing the exact architecture of their system it's hard to say for
sure, of course. But think about what the encryption they claim to use is
probably supposed to accomplish. Then think about whether it actually does
that if a large proportion of DropBox's servers and employees have access to
the decryption keys.

[edit: Amazon store data on S3, so it is in fact important that they encrypt
it (even with relatively relaxed key management) as they have no direct
control over the infrastructure. I still don't think this meets the bar of
"military grade security", but I guess that's marketing for you.]

------
AlexandrB
Bottom line - you should assume that you cannot trust ANY cloud-based service
to keep what you upload safe from hackers (if they're determined enough), and
especially from governments. I'm not sure why anyone would be under the
illusion that this is the case. Even assuming that hosting such a service
would be legally possible today (not sure if it is, IANAL), it could be
illegal tomorrow and the service may be compelled to hand over any data.

If you want something to stay secret you MUST either:

a. Not put it on the internet/cloud.

b. Encrypt it yourself before uploading. Yes, there are trust issues with
commodity encryption software as well, but these may be mitigated somewhat.

~~~
daganh
This really isn't an issue for legitimate users backing up or syncing original
data. If you are paranoid about privacy, then you are probably doing something
wrong. If for some reason some other user produces the same file that I have,
then whats the big deal? They already arrived at that information themselves,
so there is still no compromise of data. Of course the government can seize
your data whether it is online or not. Don't sync copyrighted material. How
many different ways do they have to tell you it's illegal?

~~~
tomkarlo
"If you are paranoid about privacy, then you are probably doing something
wrong."

That's a pretty slippery slope to presume that a desire for privacy is
tantamount to an admission of guilt. Why would you presume that simply because
I have information I want to keep secret, I must be doing something wrong?

Conversely, is it ok if we install webcams throughout your house, since
apparently you have nothing you'd like to keep private for non-criminal
reasons?

~~~
daganh
In this instance we are talking about data that is already encrypted. Now tell
me why anyone would worry about their encrypted data being identifiable unless
it's not their data. Not something I have to worry about and suspect majority
of users don't need to worry about this either.

~~~
tomkarlo
For the same reason lots of folks may not want their Netflix queue or Amazon
purchase list put out in the public. Or their library lending list. All of
those are lists of legal, publicly-available items, but that doesn't mean
folks might not want and expect privacy when it comes to others knowing what
content they are consuming / collecting, regardless of its legality.

------
Terretta
It's not clear to me why Dropbox would need your keys to de-dupe. He says so
in the article, but doesn't say why.

Why not compute the file hash on your local machine before encryption, and
check that hash against a master dupe list (hash, dupe_count) of all hashes
from all users' pre-encrypted local files?

Secondly, I cannot see how this requires there to be an index of users hashes.
Surely one could store hashes with reference count, increment when a user
adds, decrement when a user deletes. The user ID isn't necessary for a
reference counter.

Not saying Dropbox isn't doing what he says. But he says de-duping proves they
can decrypt and proves they have a list of who has the same files. I don't see
it from de-dupe alone.

~~~
bgentry
The proof is indeed in the deduplication. If Dropbox can skip the upload
process of some large file because another user has already uploaded it, they
must also be able to decrypt that file in order to sync it with your other
machines.

Or in order for you to download it through the web interface unencrypted.

~~~
shasta
I suspect that dropbox works the way you think it does, but your argument
actually has a flaw. It could work like this:

\- A hash computed locally (on the clients with the large unencrypted file)
and sent along to be used by dropbox to detect dupes.

\- The key used to encrypt the large file is some function of the file, but
not of the hash. The important point is that it's not encrypted with a client
specific key, but rather a file specific one. Thus if you have the file, you
can compute it.

\- When a dupe is detected, the server requests that the uploading client send
it a copy of the key, encrypted PGP so that only the other intended clients
can decrypt it

I think that should work.

~~~
yoden
> \- The key used to encrypt the large file is some function of the file, but
> not of the hash. The important point is that it's not encrypted with a
> client specific key, but rather a file specific one. Thus if you have the
> file, you can compute it.

That's actually an extremely interesting idea. I wonder if using some function
of the data, f, to determine encryption keys, leaks information about the
encrypted data? My armchair guess would be yes. Of course, the leaked amount
(well, the slight-non random distribution by the key being a function of the
data) might be small for large files/good function f, which would mean it's
probably okay?

I've never heard of any research of a cryptosystem that works like that
though, so I'd appreciate if anyone could provide some expert input (or maybe
we should just email bruce schneier)

~~~
shasta
I don't see why it would need to leak much more information than you're
already leaking with a hash. And I'm sure someone has investigated this idea.

------
inaequitas
Without knowing the internals of how Dropbox operates, my empirical
observations are that they employ block-level deduplication, i.e. when you
change bits in the middle of the file, the whole thing doesn't get re-
uploaded. Which means they keep pointers and have an algorithm that's similar
to LBFS (and Rabin fingerprints)

This means it's theoretically possible for parts of the file to come from
different sources, which means contraband files are 'built' from parts of
otherwise legal files.

------
varenc
With pure file encryption where the user's password serves as the key you
lose...password recovery features, public links for files, shared folders, web
access, mobile access (unless you want your phone doing the decryption)

All other syncing services do things a pretty similar way

~~~
yoden
This is wrong in several ways (as evidenced by other comments in the thread):

> public links for files easily done by creating an unencrypted copy

> shared folders you can do this by copying, encrypting with new keys for each
> shared user, and then sending notification of those keys to the user (via
> some side channel), and then deleting them on your end after they are
> accessed (you can re-encrypt with the accessor's keys at this point). You
> can do a bit better if you leave "half open" asymmetric channels (so you can
> store encrypted messages/keys only the recipient can decrypt), but that
> might be overkill.

> web access you probably will need to use java (or NaCl) to do this, as
> javascript tends to be too slow to do asymmetric encryption (the the needed
> bignum support just isn't there). If you're willing to wait, it can be done
> in javascript in ~5 seconds on a desktop PC.

> mobile access Uh, all modern smart phones can do encryption fast enough
> (well under a second).

> password recovery features This is the only salient point. You can kind-of
> counter it by using the security questions to encrypt the actual encryption
> key a second time, so that the data can be decrypted if you answer the
> security questions. But that's obviously less secure.

------
podperson
Rather than avoid deduplication (which is technically sound and benefits
everyone), perhaps the solution to this is to make it impossible for DropBox
itself to know who owns which files.

E.g. right now I assume a dropbox user owns a list of file ids with some
metadata (e.g. that user's name for those files). If follows that if the
government decides file XYZ is illegal then anyone with XYZ in their list is
in trouble.

The user account could keep track of the total size of all the user's files
and use arithmetic to keep it up-to-date, but not actually store the size of
individual files except when they are "looked at".

So then the user's password (say) which is not itself stored is used to unlock
stuff in the user's file table on a per request basis -- i.e. the actual file
ids are only computed as needed. The actual mechanism doesn't need to be
terribly secure, it just needs to be _deniable_. In other words without the
user's password we simply cannot unambiguously determine which files are his
or hers.

------
guan
Of course Dropbox knows the keys. If they didn’t, you wouldn’t be able to
access your files on so many platforms (web, desktop, iPhone) and you wouldn’t
be able to easily share folders with others.

Even though it isn’t spelled out, I’ve always suspected that many actual
backup services such as Backblaze don’t know the key if I decide to encrypt my
backups.

~~~
TomasSedovic
I don't think the "many clients" argument holds. If done properly, Dropbox
would (deterministically) generate the keys from your username and password on
the client every time you log in and encrypt/decrypt stuff there.

You're right that sharing folders between different accounts would require
having a key shared between the clients and thus stored on the Dropbox servers
as well.

However, they could generate the key for the shared folder, give it to you and
your buddies and yet store it encrypted using your master key generated from
your username and password. Then it would be accessible to you, but Drobox
would not be able to decrypt it without getting your password.

It looks like they're not doing that, but hypothetically, they could. And the
examples you cite would not be impossible to deal with.

Of course, this would require much more work, would be tricky to get right and
they'd have Thomas Ptacek on their back for using JavaScript crypto in the
browser.

~~~
guan
You are absolutely right in terms of what is technically possible. My point
was more that it’s not really realistic to use crypto like that with a
Dropbox-like service.

------
kenjackson
Why doesn't DropBox just stop doing de-duplication? They must have the money
for the storage? The bandwidth savings for users isn't that big of a deal in
most cases. I expect that if I have a 2GB file that I'm uploading a 2GB file.
I don't cross my fingers that you already have big chunks of it.

This just seems like the type of thing that someone much smarter than I can
and will exploit in the future.

~~~
wazoox
Dropbox provides lots of storage for free. Therefore they have a strong
incentive to provide this free storage at the lowest possible cost, and
deduplication definitely makes sense for themselves if not the users.

------
bayes
This makes me worried about hash collisions as well. The article implies that
a file whose hash matches something they already have will never even reach
their servers - so presumably I just have to keep my fingers crossed that the
file they're synchronising to all my machines is the one I uploaded, and not
some other user's completely different file that happens to have the same
hash?

~~~
ceejayoz
Deduping with hash _and_ filesize in bytes should make a collision unlikely
enough.

~~~
gojomo
This is a common sentiment, but not really sensible. If you want to store (for
example) another 64 bits worth of information, you would always be better off
with 64 more bits of some strong hash than 64 bits of filesize.

~~~
slackerIII
I think the point is that creating a collision when data + fileSize is hashed
is much harder than just creating a collision when hashing data alone. Raymond
explains it well:
[http://blogs.msdn.com/b/oldnewthing/archive/2004/05/19/13493...](http://blogs.msdn.com/b/oldnewthing/archive/2004/05/19/134937.aspx)

~~~
justincormack
And that was written before the md5 collisions were discovered. And no
collision has yet been discovered for md5 for files of the same length, they
are all extension attacks...

~~~
gojomo
Absolutely false. Some MD5 collision-generators _specifically_ find pairs of
equally-lengthed inputs with the same hash. See for example hit #2 for [MD5
collisions]:

<http://www.mscs.dal.ca/~selinger/md5collision/>

'Extension attacks' are something else, which let you turn one collision into
more, or create valid hashes for combinations of unknown text plus a chosen
extension – not find an initial collision. See:

[http://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_con...](http://en.wikipedia.org/wiki/Merkle%E2%80%93Damg%C3%A5rd_construction#Security_characteristics)

The 'length extension' property can be helpful, once you find a collision
based on 'random' nonsense, in extending that into two documents that are each
meaningful-but-different and still colliding, as was done in this 2005 MD5
collision demonstration:

[http://replay.waybackmachine.org/20050612011328/http://www.c...](http://replay.waybackmachine.org/20050612011328/http://www.cits.rub.de/MD5Collisions/)

------
kmfrk
Early optimization is the root of all evil, so I understand that an up-and-
coming company might do this, but Dropbox has the traction and userbase to
make this a very relevant concern.

Popularity is also proportional to chance of being targeted by hackers and
approached by government or corporation representing intellectual property
owners.

------
diegob
If you knew a target file's hash, it might be possible to modify the dropbox
client to report that file as added, then dropbox would download that file
onto your computer. Of course, 10^77 possible hashes makes it unlikely.

------
rarrrrrr
Previous HN discussion of SpiderOak's (very different) approach to this same
topic: <http://news.ycombinator.com/item?id=1640074>

~~~
wazoox
SpiderOak definitely looks better from an end user's point of view. How is it
going? All of PR love seems to be going to DropBox nowadays.

------
ikcor
Dropbox has responded to this: <http://forums.dropbox.com/topic.php?id=36365>

------
ToastOpt
Could be worse. Last year when I tried ZumoDrive (a similar service), I
noticed it marks the web-browser login cookies as safe for HTTP, and defaults
to open session pages via HTTP. All it takes it checking your ZumoDrive once
from an unsecured WiFi and your account may be compromised.

At least Dropbox gets the endpoint-to-server encryption right.

------
qeorge
Its not obvious to me this is a price based decision for Dropbox (although the
benefit there is obvious).

Arguably the best feature of Dropbox for me is binary diffs. If you encrypt
the Dropbox this goes out the window, or at least becomes significantly harder
to pull off. Am I wrong?

------
dpcan
Uhm "(if it didn't, it wouldn't be able to detect duplicate data across
different accounts"

How about comparing a hash of the encrypted data?

~~~
orangecat
Encryption keys would be different for different users.

~~~
AjithAntony
I don't think anyone suggested the keys were unique to users.

------
kapitalx
tl;dr Dropbox use their own encryption keys to encrypt your data rather than
encrypting each user's data using a user provided key. This helps them dedup
files and save space/money. This implies that a court could ask to analyse
your data. Dropbox will update their privacy policy to say this clearly.

------
cageface
I've been traveling for most of the last six months, mostly dependent on slow
& unreliable hotel wifi. Dropbox's implementation has saved me a ton of time
backing up files that would have taken forever to upload in their entirety.

------
dude_abides
Here's a startup idea: A background service that runs on a PC/IPad/Phone,
checks for new media files (pictures/mp3s/videos) and automatically re-encodes
them, such that the quality, etc. is preserved but the file hash changes and
cloud services can no longer deduplicate it.

It will be quite valuable for users of services like Dropbox, Amazon Cloud
Player, etc.

~~~
bigiain
Here's another idea: a service that identifies hash values for popular-but-
copyright-encumbered files, which you can feed into your (patched copy of)
Dropbox client and pretend you're about to upload it. Bam, it appears in your
Dropbox account!

And sneakier still, a remote hosted bit torrent client that automates that
process for you in a way that can provably show you never downloaded the
copyrighted file, you just had a number on your drive, which if interpreted as
a SHA256 hash key happened to identify a duplicate of the copyright file on
Dropbox... (I'd be spectacularly impressed if someone managed to make _that_
trick fly on court in a precedent setting manner!)

~~~
TillE
That's wonderfully devious. And probably quite straightforward to implement.

I hope it doesn't happen, because that'd be a bit of a nightmare for Dropbox.
They could blacklist known pirate rips of movies and the like, but what do you
do when you receive a takedown notice for DRM-free content that some users are
legitimately storing, but Dropbox is now (accidentally) illegally
distributing? Could be the end of global de-duping.

~~~
shasta
If this really became a problem for Dropbox, one response would be to have the
server issue random challenges to clients that purport to have the file.

