Hacker News new | past | comments | ask | show | jobs | submit login
ZeroDB white paper: An end-to-end encrypted database (arxiv.org)
171 points by mwilkison on Feb 24, 2016 | hide | past | favorite | 38 comments

Interesting concept. I disagree with this statement from the "Data confidentiality" section:

> Unlike CipherBase [6], our approach doesn’t leak information about the order of the elements. By observing access patterns, the server can only figure out which buckets are leaf nodes and which are branch nodes, but not whether the child node is located to the left or right of the parent node.

The paper correctly notes that when a client does a range query, it leaks information about the total number of blocks returned. However, if I've understood correctly, it also leaks the identities of those blocks. The closer two tree nodes are in order, the more likely they are to be touched by the same request. A passive adversary could use this to extract information about the relative ordering of encrypted nodes.

Also, active adversaries cause much worse problems. Let's say I have the ability to cause someone to do queries on the encrypted database, without observing the results. By simply doing separate index lookups on two different fields, I can extract a lot of information by looking at which nodes were touched by both queries.

Finally, I noticed that the paper doesn't describe how the B-trees are modified. If the client is the only party that knows how to decrypt nodes, then it's responsible for splitting and merging nodes to keep the tree balanced. Doesn't this require some kind of distributed coordination, so that multiple clients don't trample on each other's changes?

Another thing from that same section that worried me was their discussion of inference attacks on first and last names. Their results assume names will be queried based on a particular distribution instead of an arbitrary (maybe even adversarially-chosen) one. This is a very, very weak model of security that categorically eliminates realistic attacks.

The thing is that anybody who is able to initiate queries has access to that data. No need to do inference attacks over there :-) So, an adversary cannot query data based on particular distribution b/c it cannot initiate meaningful queries.

If an active server-side adversary can trick the client into doing certain queries is another question though.

You're not understanding my problem. You modeled security against a certain attack by picking one particular distribution of queries and saying "we resist this attack this much with this query distribution". An attacker doesn't need to choose the query distribution for your result to not apply, it just needs to be any other possible distribution than the one you chose.

You are right about the identifiers. And that is exactly what makes us concerned about access patterns attacks.

They are not as dangerous when queries are rare. But we do explore possibilities to mitigate this attack vector completely using oblivious RAMs which become more and more performance.

You are also right about clients updating the b-trees. Currently we use object-level locks to keep data consistent. We may add storing deltas for buckets to keep them on the server when a conflict appears plus use additively homomorphic encryption for counters (BTree.Length object).

Also you cannot do queries on the encrypted db without being able to observe the results. E.g. if you are able to initiate a query on a fraction of the data, you already have access to that fraction of the data. And you cannot initiate a meaningful query if you are a server-side observer.

Of course, active adversaries could give wrong results (wrong pieces of the index) back, we didn't quite consider yet which problems that could cause.

> And you cannot initiate a meaningful query if you are a server-side observer.

But that's only true if you consider the database in isolation. It's a very bad assumption to make when using it as a component of a larger system!

Let's say you build a social networking service which uses ZeroDB to store its data; I'm an attacker who has access to the database server. There are all kinds of ways I can gain indirect control over the requests that users make to the service. What if I convince two different users to open a web page containing something like "<img src='https://supersecuresocialnetwork.com/friend-list.jsp'>"? I've just triggered an operation that will cause the app servers to request tree nodes corresponding to those users' accounts, and those of their friends, at predictable times. I can correlate the requests and determine with some level of confidence whether the users are mutual friends, all without being able to see a single byte of plaintext. If the service is publicly available and I can register my own accounts, that opens up another huge category of attacks.

My point is that you're making an awful lot of assumptions about what strategy an attacker will use. If you don't have provable security against an entire class of attacks (e.g. cryptographic properties like IND-CPA) then the burden is on you to be more imaginative than whoever is trying to compromise your system.

Is there a reason that you did not cite Enigma in your list of notable works? They solve the adversary problem by splitting tasks among a network of nodes that post their intermediate work to a blockchain where it can be verified for accuracy:

Very recently, Baum et al. developed a publicly auditable secure MPC system that ensures correctness, even when all computing nodes are covertly malicious, or all but a single node are actively malicious [18]. Their state-of-the-art results are based on a variation of SPDZ (pronounced speedz) [19] and depend on a public append-only bulletin board, which stores the trail of each computation. This allows any auditing party to check the output is correct by comparing it to the public ledger’s trail of proofs. Our system uses the blockchain as the bulletin board, thus our overall security is reduced to that of the hosting blockchain.

I'm not a db or crypto expert, but this seems like a promising solution to the active adversary problem if you have an honest, healthy blockchain that is incentivized correctly.

Yeah, actually that is true. Enigma worths mentioning as one of MPC references.

Looks like Enigma would have challenges in many practical environments (mainly because it involves a lot of small interactions between nodes, so should be affected by latency). So, it's more like a way to build "Etherium of private computations".

But, as you pointed out, it can have good clues how to solve the problem of active adversaries!

Worth noting one of the previous posts, https://news.ycombinator.com/item?id=9262846

I raised some questions there related to your last concerns. I'm really interested to go through the current state and see how things have evolved to mitigate them.

This looks very interesting.

I guess that this would be the next-best thing after FHE?.

The catch is the performance, right? I mean, the paper says:

"In order to make the database end-to-end encrypted yet still capable of performing queries, the client encrypts the buckets (at the time of creation or modification). The server, which stores the buckets, never knows the encryption key used. The objects referenced by the leaf nodes of the B-Tree indexes are also encrypted client-side. As a result, the server doesn’t know how individual objects are organized within the B-Tree or whether they belong to an index at all. Since ZeroDB is encryption agnostic, probabilistic encryption can ensure that the server cannot even compare objects for equality. When a client performs a query, it asks the server to return buckets of the tree as it traverses the index remotely and incrementally, as in Fig. 2b. The client fetches and decrypts buckets in order to figure out which buckets in the next level of the tree to fetch next. Frequently accessed buckets can be cached client-side so that subsequent queries do not make unnecessary network calls."

Which would suggest that it might be a bit slow to do many round-trips for each query.

I'm guessing there could be some specific use-cases where this is not so relevant?

I would love to have some more knowledgeable people comment on this as it would be really neat to be able to start using a DB with these features.

> When a client performs a query, it asks the server to return buckets of the tree as it traverses the index remotely and incrementally, as in Fig. 2b. The client fetches and decrypts buckets in order to figure out which buckets in the next level of the tree to fetch next. Frequently accessed buckets can be cached client-side so that subsequent queries do not make unnecessary network calls."

>Which would suggest that it might be a bit slow to do many round-trips for each query.

I was curious and found some information through a search (key point italicized) [1]: "On the performance aspect, with a real world use case of 1GB index, just 150KB of data must be transferred on average over three requests to fetch the results back. In full text search terms, 250MB of data can be queried in around 500msec which even though slow, may not be prohibitively slow for some use cases. Insert queries also may take around the same time. The number of requests needed to fetch the query results grows logarithmically with the data size."

[1]: http://www.infoq.com/news/2015/04/Encrypt-Database-CryptDB-Z...

>I'm guessing there could be some specific use-cases where this is not so relevant?

They claim the performance degradation is not much, but that would become worse as the number of clients grows. The clients also would need to have a relatively decent amount of RAM and CPU power to handle this.

Related: CryptDB from MIT - http://css.csail.mit.edu/cryptdb/

Yep, relevant pieces, thanks for bringing this!

Our performance become a little better since then. And also we'll update our fulltext search algorithm using Lucene's practical scoring function: that makes it more scalable and will also help with performance.

As number of clients grows, things actually become better for read queries (as long as we support MVCC). However, multiple writing clients can create congestion if they write data to the same place which can affect the performance indeed.

Do you have a specific use case in mind for this?

I ask because apparently there would be a practical limit on the size of the DB.

So considering that you would probably need to restrict the DB use to something specific, what would that be? (at least while performance gets better/practical for very big datasets)

I don't think it's so much of a size issue. I'd use it for applications where you don't have too many simultaneous users of the same dataset (in future - that constraint would be only about simultaneous writing users).

When users have their own private datasets, that's ok to scale up number of users though.

This limitation comes from a) invalidation requests and b) limited ability to do server-side conflict resolution

Nice find. Thanks!

Seems that this initial release only supports python. The client is obviously heavy, since it has to run all the queries, etc. Thankfully the protocol is well defined, so it's just a question of effort to support more languages.


They are our friends and i continuously asking them to make open standard instead of implementing everything on python and python database and make it work with psql or cassandra. I hope they will release something for this as we are wish to include zerodb to our encrypted messaging (actor.im).

In fact, the approach you used for actor (using GWT) is no less applicable to us (using pyjs or rapydscript - see https://medium.com/@ZeroDB_/zerodb-js-zerodb-in-the-browser-...)

Yes, i read it. But i am not talking about browser. What about IoT? What about mobile? Running python in mobile? No one will do this as it is just non-professional for majority of mobile devs. Some niche mobile apps on python is cool, but you will never find python iOS developer and you will never bundle python to your app.

Protocol and some clean reference implementation without dependencies will be neat.

Actually, I think it's not too hard to get a python library to iOS/Android (especially Android). But I see your point.

Reference implementation without dependencies would be just a more cross-language rewrite of these dependencies though, so would take quite a while. Good news is that authors of ZODB which we use as a dependency are willing to do that (and already do that) in a more cross-language way.

I read the whitepaper, and while I have many questions one detail in particular jumped out at me:

"The client authenticates with an X.509 certificate or a self-signed certificate where the private key is derived from a passphrase. When a passphrase is used, the scrypt algorithm is used to derive a private key from the passphrase. Information about the salt is stored server-side and is given before establishing an authenticated encrypted connection. ... The symmetric key is derived from either the passphrase, the private key in an X.509 certificate, or provided by a key management service (KMS)."

You're deriving a private key from a user-defined passphrase, then storing the salt used to derive the password on the server, which is defined to be untrusted. I understand you use scrypt to derive the private key, but this still seems like a massive, massive footgun. Database systems are usually kept up and running for a long time, like maybe even years. If the server gets the salt, isn't it only a matter of time before even a very slow brute-force attack can re-derive the authentication credential and make whatever queries it wants?

Deriving from passphrase is optional. Mainly, for development and testing, and also for users who are ok with passphrases.

Having a passphrase + server-saved salt makes it better than a passphrase with no salt (for brute-force by dictionary).

If you think about it, security of passphrase approach is tested all the time with Bitcoin brain wallets. Over there, private keys are derived using SHA256 (which is fast to compute!) with no salt. It's hard for humans to generate good passphrases, so these wallets do that for them (usually 12 words).

But yeah, in most production usecases it would be some certificate, or KMS

Yeah, this approach has been tested and has failed miserably: http://eprint.iacr.org/2016/103.pdf

I can't help but compare this system to Datomic. Effectively, what ZeroDB is calling the "database" is what Datomic calls a "storage service"; and what ZeroDB is calling the "client", Datomic calls the "transactor."

In both architectures, "blocks" of data get pulled from the storage service (database) to the transactor (client), which decodes blocks into rows of tables or indexes, does a query which possibly mutates those tables/indexes, generates new blocks from the mutated tables/indexes, and uploads them.

The difference is that ZeroDB's blocks are encrypted by the client, obviously—but I'm not sure that this is a semantic difference in the architectures of the two databases. I'm wondering if an equivalent change could be made to Datomic's transactor logic, without materially affecting the way Datomic works. (The question is basically how much of Datomic's required block metadata, necessary for finding the blocks on the storage service, leaks information which ZeroDB has managed to sequester inside the encrypted blocks instead.)

Huh, interesting comparison. Thanks for pointing that out! We'll read more about Datomic: maybe we can learn something useful from them

BTW, you can hop in our Slack [http://slack.zerodb.io/] channel if you have any questions.

Being able to query/search/sort on encrypted data sounds really powerful and super relevant for this day and age.

I look forward to playing around with it.

Very cool MacLane!

SQL Server has Always Encrypted - https://msdn.microsoft.com/en-us/library/mt163865.aspx

That doesn't appear to use homomorphic encryption though

Yeah, that is actually deterministic encryption.

Microsoft had a very interesting research paper called CipherBase (we cited it), but I think they didn't end up commercializing that research

OK - So why should I use ZeroDb when it seems to be using Crypt-DB for encrypted query processing [which is the USP of ZeroDb] when cryptDb is fully open source, free and verifiable?

ZeroDB isn't using CryptDB. And ZeroDB is open source: https://github.com/zero-db/zerodb

Thank you for the clarification. How are you computing over encrypted data? From the "acknowledgements" section of the paper, it seems to be the same technique as cryptDB. If so, where is this source code? [I'm assuming this to be in C/C++] in the "crypto" folder as in cryptDb

No, we don't use any techniques from CryptDB at all. We thought about doing server-side set intersections using some ideas from CryptDB but we decided to not do that.

In brief, we don't compute on encrypted data, we search for encrypted data using encrypted indexes (which are operated from the client, even though are located remotely).


Guess I need to rename https://github.com/StabbyCutyou/0db

Also the name of a popular silent disco company in San Francisco:


Clearly not overlapping business though.

Yeah, right. I guess, as long as we don't redirect encrypted content of a database to a sound device, we should be all right :-)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact