The first part of the article is of unusually low quality for Ars Technica. It seems like the author is trying to bash Mega for being reportedly lazy in their implementation.
This really is a shame because the parts regarding entropy, deduplication, and the conclusion, are very relevant!
> Files and folders, therefore, are encrypted with symmetric encryption. Symmetric encryption means the same key is used to encrypt and decrypt your data; this is less secure than asymmetric encryption (where one key encrypts and a different key decrypts), but it's faster and easier to implement.
Asymmetric encryption is not appropriate for encryption of large amounts of data (or would require incredibly large keys to work). Symmetric encryption is always used to encrypt large chunks of data, and asymmetric encryption can be used to encrypt a symmetric key.
The article seems to hint that Mega was lazy in its implementation, that's not really true.
> For the data stored in Mega, the encryption key used is generated for you at the time of sign-up and is itself hashed—or scrambled—using your account's password.
The key can't be hashed, or it couldn't be recovered. A hash is meant to be un-reversible. Unless the article means that the key itself is a derivation (a hash?) of the password. This would indeed be a major issue!
If the key is encrypted using the password (hopefully after running it trough a PBKDF), it is fine.
The absence of a password change option is indeed worrying, but the absence of a password recovery option is reassuring. If you lose your password, then you lose the ability to decrypt your key. If you lose your ability to decrypt your key, you lose your data.
It indeed is problematic that you can't back up the key though!
I said this on the last thread about Mega. I'm fairly sure that the encryption is not to protect your data from them and others, its to protect them from your data by giving them an optionality to deny any knowledge of what they are hosting.
It is in there interest to do de-duplication and as they are still required to remove files under the DMCA it will just mean that multiple people loose there infringing files at once. All someone then needs to do to upload the file again is change one byte.
The encryption also stops them from being able to implement a back door for the content industry to police the site themselves as was the case with megaupload.
If each file is getting it's own key, that is, if THEHOBBIT.mp4 properly generates a unique key each time it's uploaded by each user, would there even be much overlap in Alice's and Bob's file data as hosted on Mega?
If the file blocks are being encrypted with a unique key before they're uploaded, you seemingly wouldn't have to alter any bytes of a file that got hit with a DMCA request. You'd just have to re-upload it.
If every copy of THEHOBBIT.mp4 resulted in the same data after it was encrypted, if it was a predictable transformation, I don't see how any court would give Mega a pass for not 'knowing' what the contents of an uploaded file were. They'd not only know they'd be explicitly generating a fingerprint of that unique file in creating the encrypted version. It'd be irrelevant as a legal shield.
Block-level de-duplication doesn't require or imply knowledge of, or access to, decrypted file contents. It would be trivial to use de-duplication in a secure way. (though you're not going to reap nearly as much return from de-duplication of secure files, natch)
I'd agree that they're not serious, if it's true that they're deterministically generating encryption keys based on file hashes. But the de-duplication part has nothing to do with it.
When the signature is generated, they have no idea how close one file is to another file. So long as various copies of THEHOBBIT.mp4 continue to be uploaded, modified by just one byte, user will be able to evade detection, even if Mega keeps a "known bad" list of signatures.
The practice of having to create and share one-byte-different copies of files would be enough to kill this service for the users that Dotcom professes to be enabling, in thumbing his nose at the authorities.
The biggest user-facing problem with torrents today is that it takes too long to find 'the right' or 'good' files. Ones that aren't loaded up with malware, or mislabeled, or encoded poorly, etc. Having to push various one-byte-difference copies of files is going to explode the search problem for users. And, once found, 'good' files couldn't be popularized, lest they be pulled down. And once one user scored a 'good' file, they wouldn't be able to share it directly through the service to even known associates due that same universal blacklist.
The usability on such a system sounds like a wet-dream for driving people into the arms of for-pay alternatives.
Further, having a fingerprint on every file you upload is a direct violation of the marketing pitch of security and anonymity that Dotcom's making to the 'legit' crowd.
Consider if there were a security breach at a popular online service and a text file of compromised accounts released. The FBI could upload a copy and then subpoena Mega for information on every user who had that same copy.
Sure, people could alter their files themselves, but who wants to put up with that kind of cognitive overhead to prevent malicious prosecution/persecution?
Was about to state the same. The encryption is Dotcom's shield against future lawsuits toward himself or Mega. The simple fact that it is technically 'impossible' to know what users store on their servers protect them from the megaupload situation happening once again.
As read elsewhere, do not store any confidential file on mega: the encryption does not protect the user, but the platform itself.
I think that infringement protection can be handled using the following scheme:
User uploads file. Mega computes the convergent encryption E(F) using the hash of the file H(F). The hash of E(F) is H(E(F)) and determines the file already exists in Mega, and thus is deduplicated.
Mega does not tell the user this, and their used storage size increases (thus, RIAA cannot upload The Hobbit and determine it's already there). The user enters their password P, the convergent hash H(E(F)) is encrypted with the user's password - P(H(E(F))) - and is only stored correlated with the user as such. The hash of the original H(F) (used to convergently encrypt the file) is also encrypted with the user's password, as P(H(F)).
On retrieval, the user enters their password P, the hash P(H(E(F))) and the hash of the original P(H(F)) is decrypted. Now Mega knows where to find the convergently encrypted file, using H(E(F)) to locate E(F). Mega decrypts using the hash of the original, and returns the file F to the user.
If the password P is only stored hashed (as it should!), then there is no way to correlate a given infringing file with any other user's ownership. The user's account only contains P(H(E(F))) and P(H(F)), both of which are unique to that user.
Anyone see a problem (other than the implementation details and possible lack of motivation for Mega to do so)?
Symmetric encryption means the same key is used to encrypt and decrypt your data; this is less secure than asymmetric encryption (where one key encrypts and a different key decrypts), but it's faster and easier to implement.
Isn't that comparing apples to oranges? What would be the benefit for Mega or the user to switch to RSA for that? Encryption would be using a symmetric cipher anyway, unless I'm missing something.
You're not missing anything. I clenched my teeth when I red this phrase. It doesn't mean anything and discredits the whole article as it shows the author has got a superficial knowledge in cryptography.
Isn't in-browser crypto clown crypto by definition though?
After all, unless you audit all the code fetched each time you load the page, they can mess with the code client side at any moment without anybody noticing. Is this not more telling about the limits of webdev rather than the skills of Mega's coders?
The trust model is inherently broken when doing crypto in the browser the way they are, since the code could be changed at any time. But that doesn't mean you shouldn't implement things properly outside of that.
Every time you visit the website, the crypto code is brand new to you. A site that securely and safely encrypted your data yesterday might be sending keys to the server tomorrow. This is a problem that's fairly unique to doing crypto on the client side in the browser.
> Symmetric encryption means the same key is used to encrypt and decrypt your data; this less computationally-intensive and easier to implement than asymmetric encryption, which we'll get to in a moment.
"Files and folders, therefore, are encrypted with symmetric encryption. Symmetric encryption means the same key is used to encrypt and decrypt your data; this is less secure than asymmetric encryption (where one key encrypts and a different key decrypts), but it's faster and easier to implement."
I just saw that and facepalmed as well. Beyond the sheer number of attacks you can perpetrate on RSA versus AES (never mind in the browser), MEGA is using 2048-bit RSA, which has a 112-bit security level, versus 128-bit AES.
Symmetric ciphers are used almost exclusively for the encryption of large files. They're more secure, faster, and easier to work with.
> [...] the implication is that a uniquely identifiable thing can be derived from any given piece of data. This returns some burden to Mega—rather than throw up its hands and say that it has no idea what Mega users Alice or Bob have in their Mega accounts, there apparently is a way of telling whether or not Bob and Alice have the same file or files. If the MPAA gets wind that Bob is hosting a copy of The Hobbit: An Unexpectedly Long Movie in his Mega folder, and Alice also happens to have the same file in her Mega folder, it's trivial to prove that Alice has the same file—in fact, the nature of deduplication means there's some record of every deduplicated block, and therefore every other infringing user.
This actually doesn't follow at all. What files a given user has in their account doesn't have to be known to the server, though it's generally not difficult to correlate gets/puts/deletes to accounts. If the hierarchy is handled on the client side and encrypted like everything else, then all the server knows is: we store blobs identified by keys, and when those keys are overlapping, we don't need to store a second copy.
This does make keeping track of storage usage impossible.
It sounds to me like Mega is just using block-level hardware de-duplication and they're making sure it's clearly spelled out in the Terms.
After all, if each file is being encrypted with different keys, then Alice and Bob's encrypted copies of THE HOBBIT wouldn't match at all. So much as removing the file from Mega's servers and re-uploading it would seemingly change the key and thus the encrypted data entirely, right?
And when encrypted blocks do happen to match, there'd be absolutely no way of knowing whether they represent the same block of the same source file. In fact, one could be reasonably confident that a block overlap was purely incidental, given the random unique keys. (barring completely broken key generation)
> After all, if each file is being encrypted with different keys, then Alice and Bob's encrypted copies of THE HOBBIT wouldn't match at all. So much as removing the file from Mega's servers and re-uploading it would seemingly change the key and thus the encrypted data entirely, right?
As far as I can tell, they're generating the keys for the files from a hash of the file, meaning the keys are not random and unique (for the files -- user keys are different). I described rough steps for secure dedupe in another comment below.
Then the encryption isn't worth the processing time. And as a 'legal shield' it seems like it will not only be utterly ineffective in court, but it's going to be cited as proof of Mega's ability to know what people are hosting, due their generation of a file fingerprint as a matter of course.
They're going to wind up having to maintain a hash blacklist which is going to be just annoying enough that people aren't going to bother.
It's a shame mega isn't just running an encrypted block-level store. They should have shipped client binaries that handle the encryption and key generation and simply exchanged blocks with Mega servers.
No, it's the other way around; a block overlap is effectively a guarantee it's the same file with the same key.
Suppose they are using 1MB chunks. The keys are 16 bytes. The 1MB chunk may be 2 ^ (8 millionish) different things, the keys only 2^128 different things. Thus, two different 1MB chunks, to be the same file with different keys, must have two of their possible 2^128 encryptions overlap out of the possibility space of 2^(8 millionish); that any two different chunks would even have such an overlap available is exceedingly, exceedingly improbable, let alone that the two users would end up with the exact correct keys to have actually manifested the overlap.
There are only two plausible theories when an overlap occurs: 1. It is the same file with the same key or 2. Someone's worked out how to artificially create a collision. And those two are hardly equal in probability either....
: Yeah, that's a simplification since there aren't actually 2^128 valid keys. Roll with it. Actually I've taken a number of small liberties for simplicity; none of them matter.
Why would you choose so large a block size for de-duplication? Your disk space savings decrease as block size increases. Choosing a more effective block size will increase the chance of collisions. And collisions are rather the desired goal of de-duplication.
And all of your other costs associated with the block go up. While I do not work on the backup product my company produces, I've used it as a service. They've told me 1MB is pretty much too small nowadays, though it made sense when they started.
People have this weird fetish around deduplication, but it isn't magic. It makes it so the tenth copy of storing a backup of Windows XP doesn't hardly cost you any space. This is where the astonishing compression claims come from. It does not magically compress much of anything else, though. The claims are true, but not generally applicable. In practice, the middle bits of otherwise unrelated files don't get de-duped, excepting a couple of obvious and rare cases like 'a megabyte of zero padding in the middle of a file' which hardly amount to anything. Your World of Warcraft texture file is simply not going to overlap with your eBook copy of 50 Shades of Grey, and in general, two things sampled even from a 2^(524,288) possibility space, which corresponds to an absurdly small 64KB block size, are very unlikely to collide.
Block size has very little effect on collisions; what collides are identical files, or files that are nearly identical because they are versions of the same thing (and block size tends to matter surprisingly little there for various reasons), and what doesn't collide is everything else.
On a related note, if any of you are designing a file format, please place metadata at the end, so multiple copies of the same file with different metadata can be easily deduplicated. If the metadata is variable length and at the beginning, the block boundaries get shifted. In order to deal with shifted files, one ends up using a pseudorandom function of block contents to decide where the dedup block boundaries are (rather than some multiple of the filesystem block) in order to be able to re-sync the dedup block boundaries between the two files beginning with different length metadata.
For that matter, if you're writing a library to handle an existing file format that has a flexible location for metadata, please put any metadata (especially user-generated metadata) as far back in the file as possible.
Of course, the main advantage is that you don't have to shift all of the data around if the user starts editing file metadata, but it does also help block-level deduplication.
I'll readily concede that de-duplication isn't going to get you much, once you start talking about uniquely-encrypted data.
1MB just sounded much larger than what I've read and seen. At that size, yeah, about the only thing you're going to be de-duplicating is entire files. Interesting that smaller block sizes don't de-duplicate enough in practice to justify it.
But I'm guessing they wanted to do it that way so that the key generation is done on the client side. They could generate high-quality keys on their servers through the normal means of doing so, but then they have access to all user's private keys, and part of the point of this service is that the users don't want to trust the service (ie, Mega).
A rather good write-up, though I'm not convinced the author's points justify the "Megabad" in the title. The decisions made are not always the most secure, but in general appear to be reasonable given the nature of the service - i.e. that this is material meant to be shared not just backed up.
Indeed, by just reading the title I thought the encryption is already broken and useless. It's just a filesharing site and not a secure and personal backup space. (I'm sure some guys will use it as their only backup like it was happening with MU though).
I'm not sure that would work either. If Alice encrypts a block with key A and uploads it, then Bob encrypts the same block with key B and uploads it and Mega only stores Alice's copy then Bob won't be able to decrypt his own data.
(Edit: I'm talking about the proposal from previous threads of encrypting with a random key, not encrypting with the hash.)
That's not how it works, though. Rough steps for how a system like this can work (bearing in mind that Mega might be doing it entirely differently):
1) Alice takes file P and hashes it to produce key K
2) Alice encrypts file P with key K to produce file C
3) Alice encrypts key K with key U (her user key) to produce key X
4) Alice sends key X and file C to Mega for storage
When Bob does the same steps on a matching file, the only things that differ are key U and key X -- these are his user key, and his user key for the data. That means that Mega can deduplicate file C freely, because they're identical.
Note that Mega never knows the hash of the original file -- key K -- or it'd be able to decrypt the files.
"Each file and each folder node uses its own randomly generated 128 bit key. File nodes use the same key for the attribute block and the file data, plus a 64 bit random counter start value and a 64 bit meta MAC to verify the file's integrity."
This implies that it is not convergent, unless dedup is somehow done with the "Meta MAC".
The trick lies in ensuring the key used for encryption is linked to the file, not the user. You then have to store a separate key for each encrypted file (and that encryption key is probably going to need to be wrapped by your actual user key for storage).
Chrome has native crypto.getRandomValues() and Safari's Math.random() was changed to use Arc4 PRNG back in 2008. I'm not sure if it has been updated to account for the flaws in the first bytes of Arc4, but if it has then this is cryptographically strong.
I don't know where Firefox and the other browsers stand on CSPRNGs, but as more and more implement crypto.getRandomValues(), this will improve services relying on client-side crypto.
It's not true. And yes, for equivalent bits of security, you need much larger keys with the RSA algorithm. The main reason it's not used for much beyond key-exchange is that it's thus exponentially slower.