> Unlike CipherBase , our approach doesn’t leak information about the order of the elements. By observing access patterns, the server can only figure out which buckets are leaf nodes and which are branch nodes, but not whether the child node is located to the left or right of the parent node.
The paper correctly notes that when a client does a range query, it leaks information about the total number of blocks returned. However, if I've understood correctly, it also leaks the identities of those blocks. The closer two tree nodes are in order, the more likely they are to be touched by the same request. A passive adversary could use this to extract information about the relative ordering of encrypted nodes.
Also, active adversaries cause much worse problems. Let's say I have the ability to cause someone to do queries on the encrypted database, without observing the results. By simply doing separate index lookups on two different fields, I can extract a lot of information by looking at which nodes were touched by both queries.
Finally, I noticed that the paper doesn't describe how the B-trees are modified. If the client is the only party that knows how to decrypt nodes, then it's responsible for splitting and merging nodes to keep the tree balanced. Doesn't this require some kind of distributed coordination, so that multiple clients don't trample on each other's changes?
If an active server-side adversary can trick the client into doing certain queries is another question though.
They are not as dangerous when queries are rare. But we do explore possibilities to mitigate this attack vector completely using oblivious RAMs which become more and more performance.
You are also right about clients updating the b-trees. Currently we use object-level locks to keep data consistent. We may add storing deltas for buckets to keep them on the server when a conflict appears plus use additively homomorphic encryption for counters (BTree.Length object).
Of course, active adversaries could give wrong results (wrong pieces of the index) back, we didn't quite consider yet which problems that could cause.
But that's only true if you consider the database in isolation. It's a very bad assumption to make when using it as a component of a larger system!
Let's say you build a social networking service which uses ZeroDB to store its data; I'm an attacker who has access to the database server. There are all kinds of ways I can gain indirect control over the requests that users make to the service. What if I convince two different users to open a web page containing something like "<img src='https://supersecuresocialnetwork.com/friend-list.jsp'>"? I've just triggered an operation that will cause the app servers to request tree nodes corresponding to those users' accounts, and those of their friends, at predictable times. I can correlate the requests and determine with some level of confidence whether the users are mutual friends, all without being able to see a single byte of plaintext. If the service is publicly available and I can register my own accounts, that opens up another huge category of attacks.
My point is that you're making an awful lot of assumptions about what strategy an attacker will use. If you don't have provable security against an entire class of attacks (e.g. cryptographic properties like IND-CPA) then the burden is on you to be more imaginative than whoever is trying to compromise your system.
Very recently, Baum et al. developed a publicly auditable secure MPC system that ensures correctness, even when all computing nodes are covertly malicious, or all but a single node are actively malicious . Their state-of-the-art results are based on a variation of SPDZ (pronounced speedz)  and depend on a public append-only bulletin board, which stores the trail of each computation. This allows any auditing party to check the output is correct by comparing it to the public ledger’s trail of proofs. Our system uses the blockchain as the bulletin board, thus our overall security is reduced to that of the hosting blockchain.
I'm not a db or crypto expert, but this seems like a promising solution to the active adversary problem if you have an honest, healthy blockchain that is incentivized correctly.
Looks like Enigma would have challenges in many practical environments (mainly because it involves a lot of small interactions between nodes, so should be affected by latency). So, it's more like a way to build "Etherium of private computations".
But, as you pointed out, it can have good clues how to solve the problem of active adversaries!
I raised some questions there related to your last concerns. I'm really interested to go through the current state and see how things have evolved to mitigate them.
I guess that this would be the next-best thing after FHE?.
The catch is the performance, right? I mean, the paper says:
"In order to make the database end-to-end encrypted yet still capable of performing queries, the client encrypts the
buckets (at the time of creation or modification). The server, which stores the buckets, never knows the encryption
key used. The objects referenced by the leaf nodes of the B-Tree indexes are also encrypted client-side. As a result,
the server doesn’t know how individual objects are organized within the B-Tree or whether they belong to an index
at all. Since ZeroDB is encryption agnostic, probabilistic encryption can ensure that the server cannot even compare
objects for equality.
When a client performs a query, it asks the server to return buckets of the tree as it traverses the index remotely
and incrementally, as in Fig. 2b. The client fetches and decrypts buckets in order to figure out which buckets in the
next level of the tree to fetch next. Frequently accessed buckets can be cached client-side so that subsequent queries
do not make unnecessary network calls."
Which would suggest that it might be a bit slow to do many round-trips for each query.
I'm guessing there could be some specific use-cases where this is not so relevant?
I would love to have some more knowledgeable people comment on this as it would be really neat to be able to start using a DB with these features.
>Which would suggest that it might be a bit slow to do many round-trips for each query.
I was curious and found some information through a search (key point italicized) :
"On the performance aspect, with a real world use case of 1GB index, just 150KB of data must be transferred on average over three requests to fetch the results back. In full text search terms, 250MB of data can be queried in around 500msec which even though slow, may not be prohibitively slow for some use cases. Insert queries also may take around the same time. The number of requests needed to fetch the query results grows logarithmically with the data size."
>I'm guessing there could be some specific use-cases where this is not so relevant?
They claim the performance degradation is not much, but that would become worse as the number of clients grows. The clients also would need to have a relatively decent amount of RAM and CPU power to handle this.
CryptDB from MIT - http://css.csail.mit.edu/cryptdb/
Our performance become a little better since then. And also we'll update our fulltext search algorithm using Lucene's practical scoring function: that makes it more scalable and will also help with performance.
As number of clients grows, things actually become better for read queries (as long as we support MVCC). However, multiple writing clients can create congestion if they write data to the same place which can affect the performance indeed.
I ask because apparently there would be a practical limit on the size of the DB.
So considering that you would probably need to restrict the DB use to something specific, what would that be? (at least while performance gets better/practical for very big datasets)
When users have their own private datasets, that's ok to scale up number of users though.
This limitation comes from a) invalidation requests and b) limited ability to do server-side conflict resolution
Protocol and some clean reference implementation without dependencies will be neat.
Reference implementation without dependencies would be just a more cross-language rewrite of these dependencies though, so would take quite a while. Good news is that authors of ZODB which we use as a dependency are willing to do that (and already do that) in a more cross-language way.
"The client authenticates with an X.509 certificate or a self-signed certificate
where the private key is derived from a passphrase. When a passphrase is used, the scrypt algorithm is used to derive
a private key from the passphrase. Information about the salt is stored server-side and is given before establishing an
authenticated encrypted connection.
... The symmetric key is derived from either the passphrase,
the private key in an X.509 certificate, or provided by a key management service (KMS)."
You're deriving a private key from a user-defined passphrase, then storing the salt used to derive the password on the server, which is defined to be untrusted. I understand you use scrypt to derive the private key, but this still seems like a massive, massive footgun. Database systems are usually kept up and running for a long time, like maybe even years. If the server gets the salt, isn't it only a matter of time before even a very slow brute-force attack can re-derive the authentication credential and make whatever queries it wants?
Having a passphrase + server-saved salt makes it better than a passphrase with no salt (for brute-force by dictionary).
If you think about it, security of passphrase approach is tested all the time with Bitcoin brain wallets. Over there, private keys are derived using SHA256 (which is fast to compute!) with no salt. It's hard for humans to generate good passphrases, so these wallets do that for them (usually 12 words).
But yeah, in most production usecases it would be some certificate, or KMS
In both architectures, "blocks" of data get pulled from the storage service (database) to the transactor (client), which decodes blocks into rows of tables or indexes, does a query which possibly mutates those tables/indexes, generates new blocks from the mutated tables/indexes, and uploads them.
The difference is that ZeroDB's blocks are encrypted by the client, obviously—but I'm not sure that this is a semantic difference in the architectures of the two databases. I'm wondering if an equivalent change could be made to Datomic's transactor logic, without materially affecting the way Datomic works. (The question is basically how much of Datomic's required block metadata, necessary for finding the blocks on the storage service, leaks information which ZeroDB has managed to sequester inside the encrypted blocks instead.)
I look forward to playing around with it.
Very cool MacLane!
Microsoft had a very interesting research paper called CipherBase (we cited it), but I think they didn't end up commercializing that research
In brief, we don't compute on encrypted data, we search for encrypted data using encrypted indexes (which are operated from the client, even though are located remotely).
Guess I need to rename https://github.com/StabbyCutyou/0db
Clearly not overlapping business though.