
ZeroDB white paper: An end-to-end encrypted database - mwilkison
http://arxiv.org/abs/1602.07168
======
teraflop
Interesting concept. I disagree with this statement from the "Data
confidentiality" section:

> Unlike CipherBase [6], our approach doesn’t leak information about the order
> of the elements. By observing access patterns, the server can only figure
> out which buckets are leaf nodes and which are branch nodes, but not whether
> the child node is located to the left or right of the parent node.

The paper correctly notes that when a client does a range query, it leaks
information about the total number of blocks returned. However, if I've
understood correctly, it _also_ leaks the identities of those blocks. The
closer two tree nodes are in order, the more likely they are to be touched by
the same request. A passive adversary could use this to extract information
about the relative ordering of encrypted nodes.

Also, active adversaries cause much worse problems. Let's say I have the
ability to cause someone to do queries on the encrypted database, without
observing the results. By simply doing separate index lookups on two different
fields, I can extract a lot of information by looking at which nodes were
touched by both queries.

Finally, I noticed that the paper doesn't describe how the B-trees are
modified. If the client is the only party that knows how to decrypt nodes,
then it's responsible for splitting and merging nodes to keep the tree
balanced. Doesn't this require some kind of distributed coordination, so that
multiple clients don't trample on each other's changes?

~~~
michwill
Also you cannot do queries on the encrypted db without being able to observe
the results. E.g. if you are able to initiate a query on a fraction of the
data, you already have access to that fraction of the data. And you cannot
initiate a meaningful query if you are a server-side observer.

Of course, active adversaries could give wrong results (wrong pieces of the
index) back, we didn't quite consider yet which problems that could cause.

~~~
roymurdock
Is there a reason that you did not cite Enigma in your list of notable works?
They solve the adversary problem by splitting tasks among a network of nodes
that post their intermediate work to a blockchain where it can be verified for
accuracy:

 _Very recently, Baum et al. developed a publicly auditable secure MPC system
that ensures correctness, even when all computing nodes are covertly
malicious, or all but a single node are actively malicious [18]. Their state-
of-the-art results are based on a variation of SPDZ (pronounced speedz) [19]
and depend on a public append-only bulletin board, which stores the trail of
each computation. This allows any auditing party to check the output is
correct by comparing it to the public ledger’s trail of proofs. Our system
uses the blockchain as the bulletin board, thus our overall security is
reduced to that of the hosting blockchain._

I'm not a db or crypto expert, but this seems like a promising solution to the
active adversary problem if you have an honest, healthy blockchain that is
incentivized correctly.

~~~
michwill
Yeah, actually that is true. Enigma worths mentioning as one of MPC
references.

Looks like Enigma would have challenges in many practical environments (mainly
because it involves a lot of small interactions between nodes, so should be
affected by latency). So, it's more like a way to build "Etherium of private
computations".

But, as you pointed out, it can have good clues how to solve the problem of
active adversaries!

------
saganus
This looks very interesting.

I guess that this would be the next-best thing after FHE?.

The catch is the performance, right? I mean, the paper says:

"In order to make the database end-to-end encrypted yet still capable of
performing queries, the client encrypts the buckets (at the time of creation
or modification). The server, which stores the buckets, never knows the
encryption key used. The objects referenced by the leaf nodes of the B-Tree
indexes are also encrypted client-side. As a result, the server doesn’t know
how individual objects are organized within the B-Tree or whether they belong
to an index at all. Since ZeroDB is encryption agnostic, probabilistic
encryption can ensure that the server cannot even compare objects for
equality. _When a client performs a query, it asks the server to return
buckets of the tree as it traverses the index remotely and incrementally_ , as
in Fig. 2b. The client fetches and decrypts buckets in order to figure out
which buckets in the next level of the tree to fetch next. Frequently accessed
buckets can be cached client-side so that subsequent queries do not make
unnecessary network calls."

Which would suggest that it might be a bit slow to do many round-trips for
each query.

I'm guessing there could be some specific use-cases where this is not so
relevant?

I would love to have some more knowledgeable people comment on this as it
would be really neat to be able to start using a DB with these features.

~~~
newscracker
> _When a client performs a query, it asks the server to return buckets of the
> tree as it traverses the index remotely and incrementally, as in Fig. 2b.
> The client fetches and decrypts buckets in order to figure out which buckets
> in the next level of the tree to fetch next. Frequently accessed buckets can
> be cached client-side so that subsequent queries do not make unnecessary
> network calls. "_

>Which would suggest that it might be a bit slow to do many round-trips for
each query.

I was curious and found some information through a search (key point
italicized) [1]: "On the performance aspect, with a real world use case of 1GB
index, just 150KB of data must be transferred on average over three requests
to fetch the results back. In full text search terms, 250MB of data can be
queried in around 500msec which even though slow, may not be prohibitively
slow for some use cases. Insert queries also may take around the same time.
_The number of requests needed to fetch the query results grows
logarithmically with the data size._ "

[1]: [http://www.infoq.com/news/2015/04/Encrypt-Database-
CryptDB-Z...](http://www.infoq.com/news/2015/04/Encrypt-Database-CryptDB-
ZeroDB)

>I'm guessing there could be some specific use-cases where this is not so
relevant?

They claim the performance degradation is not much, but that would become
worse as the number of clients grows. The clients also would need to have a
relatively decent amount of RAM and CPU power to handle this.

Related: CryptDB from MIT -
[http://css.csail.mit.edu/cryptdb/](http://css.csail.mit.edu/cryptdb/)

~~~
michwill
Yep, relevant pieces, thanks for bringing this!

Our performance become a little better since then. And also we'll update our
fulltext search algorithm using Lucene's practical scoring function: that
makes it more scalable and will also help with performance.

As number of clients grows, things actually become better for read queries (as
long as we support MVCC). However, multiple writing clients can create
congestion if they write data to the same place which can affect the
performance indeed.

~~~
saganus
Do you have a specific use case in mind for this?

I ask because apparently there would be a practical limit on the size of the
DB.

So considering that you would probably need to restrict the DB use to
something specific, what would that be? (at least while performance gets
better/practical for very big datasets)

~~~
michwill
I don't think it's so much of a size issue. I'd use it for applications where
you don't have too many simultaneous users of the same dataset (in future -
that constraint would be only about simultaneous writing users).

When users have their own private datasets, that's ok to scale up number of
users though.

This limitation comes from a) invalidation requests and b) limited ability to
do server-side conflict resolution

------
augustl
Seems that this initial release only supports python. The client is obviously
heavy, since it has to run all the queries, etc. Thankfully the protocol is
well defined, so it's just a question of effort to support more languages.

[https://docs.zerodb.io/](https://docs.zerodb.io/)

~~~
ex3ndr
They are our friends and i continuously asking them to make open standard
instead of implementing everything on python and python database and make it
work with psql or cassandra. I hope they will release something for this as we
are wish to include zerodb to our encrypted messaging (actor.im).

~~~
michwill
In fact, the approach you used for actor (using GWT) is no less applicable to
us (using pyjs or rapydscript - see [https://medium.com/@ZeroDB_/zerodb-js-
zerodb-in-the-browser-...](https://medium.com/@ZeroDB_/zerodb-js-zerodb-in-
the-browser-ffe013dca6f0#.fjtonf6u2))

~~~
ex3ndr
Yes, i read it. But i am not talking about browser. What about IoT? What about
mobile? Running python in mobile? No one will do this as it is just non-
professional for majority of mobile devs. Some niche mobile apps on python is
cool, but you will never find python iOS developer and you will never bundle
python to your app.

Protocol and some clean reference implementation without dependencies will be
neat.

~~~
michwill
Actually, I think it's not too hard to get a python library to iOS/Android
(especially Android). But I see your point.

Reference implementation without dependencies would be just a more cross-
language rewrite of these dependencies though, so would take quite a while.
Good news is that authors of ZODB which we use as a dependency are willing to
do that (and already do that) in a more cross-language way.

------
swordswinger12
I read the whitepaper, and while I have many questions one detail in
particular jumped out at me:

"The client authenticates with an X.509 certificate or a self-signed
certificate where the private key is derived from a passphrase. When a
passphrase is used, the scrypt algorithm is used to derive a private key from
the passphrase. Information about the salt is stored server-side and is given
before establishing an authenticated encrypted connection. ... The symmetric
key is derived from either the passphrase, the private key in an X.509
certificate, or provided by a key management service (KMS)."

You're deriving a private key from a user-defined passphrase, then storing the
salt used to derive the password _on the server_ , which is defined to be
untrusted. I understand you use scrypt to derive the private key, but this
still seems like a massive, massive footgun. Database systems are usually kept
up and running for a long time, like maybe even years. If the server gets the
salt, isn't it only a matter of time before even a very slow brute-force
attack can re-derive the authentication credential and make whatever queries
it wants?

~~~
michwill
Deriving from passphrase is optional. Mainly, for development and testing, and
also for users who are ok with passphrases.

Having a passphrase + server-saved salt makes it better than a passphrase with
no salt (for brute-force by dictionary).

If you think about it, security of passphrase approach is tested all the time
with Bitcoin brain wallets. Over there, private keys are derived using SHA256
(which is fast to compute!) with no salt. It's hard for humans to generate
good passphrases, so these wallets do that for them (usually 12 words).

But yeah, in most production usecases it would be some certificate, or KMS

~~~
swordswinger12
Yeah, this approach has been tested and has failed miserably:
[http://eprint.iacr.org/2016/103.pdf](http://eprint.iacr.org/2016/103.pdf)

------
derefr
I can't help but compare this system to Datomic. Effectively, what ZeroDB is
calling the "database" is what Datomic calls a "storage service"; and what
ZeroDB is calling the "client", Datomic calls the "transactor."

In both architectures, "blocks" of data get pulled from the storage service
(database) to the transactor (client), which decodes blocks into rows of
tables or indexes, does a query which possibly mutates those tables/indexes,
generates new blocks from the mutated tables/indexes, and uploads them.

The difference is that ZeroDB's blocks are encrypted by the client,
obviously—but I'm not sure that this is a _semantic_ difference in the
architectures of the two databases. I'm wondering if an equivalent change
could be made to Datomic's transactor logic, without materially affecting the
way Datomic works. (The question is basically how much of Datomic's required
block _metadata_ , necessary for finding the blocks on the storage service,
leaks information which ZeroDB has managed to sequester inside the encrypted
blocks instead.)

~~~
michwill
Huh, interesting comparison. Thanks for pointing that out! We'll read more
about Datomic: maybe we can learn something useful from them

------
mwilkison
BTW, you can hop in our Slack
[[http://slack.zerodb.io/](http://slack.zerodb.io/)] channel if you have any
questions.

------
corbinpage
Being able to query/search/sort on encrypted data sounds really powerful and
super relevant for this day and age.

I look forward to playing around with it.

Very cool MacLane!

------
0xFFFF0000
SQL Server has Always Encrypted - [https://msdn.microsoft.com/en-
us/library/mt163865.aspx](https://msdn.microsoft.com/en-
us/library/mt163865.aspx)

~~~
sigmar
That doesn't appear to use homomorphic encryption though

~~~
michwill
Yeah, that is actually deterministic encryption.

Microsoft had a very interesting research paper called CipherBase (we cited
it), but I think they didn't end up commercializing that research

------
mankash666
OK - So why should I use ZeroDb when it seems to be using Crypt-DB for
encrypted query processing [which is the USP of ZeroDb] when cryptDb is fully
open source, free and verifiable?

~~~
mwilkison
ZeroDB isn't using CryptDB. And ZeroDB is open source:
[https://github.com/zero-db/zerodb](https://github.com/zero-db/zerodb)

~~~
mankash666
Thank you for the clarification. How are you computing over encrypted data?
From the "acknowledgements" section of the paper, it seems to be the same
technique as cryptDB. If so, where is this source code? [I'm assuming this to
be in C/C++] in the "crypto" folder as in cryptDb

~~~
michwill
No, we don't use any techniques from CryptDB at all. We thought about doing
server-side set intersections using some ideas from CryptDB but we decided to
not do that.

In brief, we don't compute on encrypted data, we search for encrypted data
using encrypted indexes (which are operated from the client, even though are
located remotely).

------
StabbyCutyou
Well...

Guess I need to rename
[https://github.com/StabbyCutyou/0db](https://github.com/StabbyCutyou/0db)

------
crystaln
Also the name of a popular silent disco company in San Francisco:

[https://www.facebook.com/ZEROdBsf/](https://www.facebook.com/ZEROdBsf/)

Clearly not overlapping business though.

~~~
michwill
Yeah, right. I guess, as long as we don't redirect encrypted content of a
database to a sound device, we should be all right :-)

