
Building Searchable Encrypted Databases with PHP and SQL - CiPHPerCoder
https://paragonie.com/blog/2017/05/building-searchable-encrypted-databases-with-php-and-sql
======
tyingq
Can someone explain why it's not simple to brute force the hash in the blind
index? Since SSNs only have 9 digits, that's a billion possibilities. Is this
just a hash that's too slow to iterate over the billion? Is each row salted
differently?

Edit: I'm asking because I don't know, not because I'm saying it's possible.

Edit2: Also, it appears there was a pattern to SSN issuance prior to 2011, so
the problem space may be far short of a billion. No numbers start with 00 for
example. Also, there's a listing of numbers not ever issued[1], and a table of
"highest assigned"[2]. Since SSN's issued after 2011 are children now, they
can be safely skipped, as they aren't as interesting for fraud. Also, if the
State of residence is cleartext, you can search that space first if you assume
a large number of people live in their birth state.

[1][https://www.ssa.gov/employer/stateweb.htm](https://www.ssa.gov/employer/stateweb.htm)

[2][https://www.ssa.gov/employer/ssnvhighgroup.htm](https://www.ssa.gov/employer/ssnvhighgroup.htm)

~~~
slv77
HMAC is a keyed hash which renders simple brute force attacks ineffective
unless the attacker is also able to compromise the key.

Even without the key the system is still vulnerable to frequency analysis. For
example if it were used to store passwords an simple sort and count would
reveal accounts that have common passwords which could then be brute forced
using lists of the most common passwords found on the internet. The suggestion
of using blind indexes for searching on partial values (i.e. the first letter
of a name) would increase the vulnerability.

Even if the attacker doesn't have access to the key if the application is fast
enough he may be able to brute force it by passing every possible value and
seeing how it encrypts if the search space is small enough (i.e. common
passwords).

Secure key storage is always a challenge and the suggestion that the key is
stored on the application server instead of the database has some practical
problems. The key in many cases will end up being duplicated in code
repositories or build scripts which increases your vulnerability to internal
attackers. A better system stored half the key on the database server and the
other half on the application server which are then XOR'd to get the
encryption keys which can reduce the number of potential internal attackers
that could recover the plaintext.

Also when recovering from an attack in many cases it may be difficult or
impossible to know how many systems were compromised. Without the ability to
prove that both the data and the key weren't compromised in an attack an
auditor may require that you operate on the assumption that the data was
compromised with associated requirements for notification and penalties.

Banks long ago solved the problem of secure key storage with the use of
hardware security modules (HSMs) where the key is stored and used on a secure,
tamper proof device. This turned the problem of secure key storage into a
problem of physical security which the banks were already good at rather than
digital security. Where is my key? My key is right here.

These modules are now becoming more commonly used in the ecommerce space (AWS
now provides them, for example) but they are still fairly difficult to work
with. Some shops will used hardened/audited servers that run
encryption/decryption micro-services instead.

~~~
CiPHPerCoder
> Even without the key the system is still vulnerable to frequency analysis.
> For example if it were used to store passwords an simple sort and count
> would reveal accounts that have common passwords which could then be brute
> forced using lists of the most common passwords found on the internet.

I would hope, very strongly, that nobody would _encrypt_ passwords, but
instead, hash them: [https://paragonie.com/blog/2016/02/how-safely-store-
password...](https://paragonie.com/blog/2016/02/how-safely-store-password-
in-2016)

------
pg_bot
I disagree with the dishonorable mention as you've only considered the naive
case which you can build upon to avoid the issues mentioned. Consider a design
where everything is AES-256 encrypted in the application and the database just
receives an encrypted binary. You can build an in-memory cache for situations
where you need to search sensitive information (first name, last name, ssn) by
using a tokenizer and using a technology like Erlang term storage. It will be
eventually consistent which should be fine for our use case since that type of
information does not change that often and the cache recompute can be done
asynchronously.

From a security standpoint you don't change the threat model for the
application. If someone gets unauthorized access to the server you are already
screwed since your application can get access to the plain text data in order
to serve requests.

If you threw away that computation for every request I would agree that it is
wasteful but we don't need to do that if we rethink the role of the database
in our application.

~~~
CiPHPerCoder
> I disagree with the dishonorable mention as you've only considered the naive
> case which you can build upon to avoid the issues mentioned. Consider a
> design where everything is AES-256 encrypted in the application and the
> database just receives an encrypted binary. You can build an in-memory cache
> for situations where you need to search sensitive information (first name,
> last name, ssn) by using a tokenizer and using a technology like Erlang term
> storage. It will be eventually consistent which should be fine for our use
> case since that type of information does not change that often and the cache
> recompute can be done asynchronously.

[https://tonyarcieri.com/all-the-crypto-code-youve-ever-
writt...](https://tonyarcieri.com/all-the-crypto-code-youve-ever-written-is-
probably-broken)

[https://blog.filippo.io/the-ecb-penguin/](https://blog.filippo.io/the-ecb-
penguin/)

There are defensible positions for which ECB mode is acceptable. I wouldn't
classify the scenario you described as one of them.

> If someone gets unauthorized access to the server you are already screwed
> since your application can get access to the plain text data in order to
> serve requests.

If someone gets unauthorized access to the _webserver_ , yes, it's game over.

If someone gets unauthorized access to the _database server_ , and it's a
separate machine from the _webserver_ , then your analysis here isn't
applicable.

That's the threat model alluded to in the beginning of the article, but if
it's not _your_ threat model, then you may reach different conclusions about
the best way to protect data. (You may not even use encryption, for example.)

------
johnbro
If I understand this correctly, you basically store HMAC or password hash as a
separate field and index on it? That may work for SSNs (which are unique), but
it won't work for other things you commonly need to encrypt, such as
personally identifiable data (in healthcare scenarios).

If you (for example) encrypt first names of people (or any other data point
that is not unique per entry) using this scheme, then HMAC will reveal all the
rows that have the same first name. You can then use frequency analysis to
determine with high probablity what the encrypted names are.

~~~
CiPHPerCoder
> If you (for example) encrypt first names of people (or any other data point
> that is not unique per entry) using this scheme, then HMAC will reveal all
> the rows that have the same first name. You can then use frequency analysis
> to determine with high probablity what the encrypted names are.

This is where things get difficult to explain, because most health care
programs are going to care about compliance first and foremost. So with that
in mind:

1\. You probably don't need to encrypt their first name to be e.g. HIPAA
compliant, but...

2\. Using a very short Bloom filter increases the odds of false positive
collisions. Combine this with Argon2 and aggressive rate limiting, and now
you've frustrated frequency analysis and chosen plaintext attacks greatly.

Given the threat model that we've given (database is not the same machine as
the webserver, and the database server is what gets compromised), I can't see
a better solution.

------
big_spammer
How is this different than CryptDB?

[https://css.csail.mit.edu/cryptdb/](https://css.csail.mit.edu/cryptdb/)

~~~
CiPHPerCoder
From what /r/crypto says, they leak metadata to the database server.

[https://www.reddit.com/r/crypto/comments/6e5jjz/building_sea...](https://www.reddit.com/r/crypto/comments/6e5jjz/building_searchable_encrypted_databases_with_php/di8n26v/?context=2)

------
cdevs
I know this is a I'm going quest and for that reason it annoys me when i see
my companies competitors say they encrypt all their users searchable data and
I know what they are really doing is using Amazon disk level encryption that's
not really helpful if your live database gets dumped.

------
kakoni
Here's a (naive) ruby implementation.
[https://gist.github.com/kakoni/399e5324611db9f89806172124549...](https://gist.github.com/kakoni/399e5324611db9f89806172124549aa7)

------
Quiark
I agree that the HMAC approach is better but a proper deterministic encryption
would work just fine, wouldn't it? You're comparing your method with ECB mode
which is a bit unfair.

~~~
CiPHPerCoder
What are some constructions you'd consider "fair"?

~~~
Quiark
AEAD cipher with nonce generated from plaintext? That would have equivalent
security properties except for being needlessly complicated?

~~~
CiPHPerCoder
So you mean something like this?

    
    
        $nonce = sodium_crypto_generichash($message, '', 24);
        $ciphertext = $nonce . sodium_crypto_aead_xchacha20poly1305_ietf_encrypt(
            $message,
            $nonce,
            $nonce,
            $key
        );
    

That solves the literal search problem, but none of the others.

However, in my experience[1], when it comes to deterministic encryption, most
PHP developers will use ECB mode, or CBC mode with a NULL IV.

[1]:
[https://meta.stackoverflow.com/q/293930/2224584](https://meta.stackoverflow.com/q/293930/2224584)

------
piotrkaminski
Why not just use AES-SIV instead of separate encrypted and HMAC fields?

~~~
CiPHPerCoder
The practical reason: [https://3v4l.org/LEar1](https://3v4l.org/LEar1) (No SIV
in that list.) It's also not present in libsodium or ( _shudder_ ) mcrypt.

What property are you hoping to extract out of AES-SIV in particular? (And
would AEZ, HS1-SIV, AES-SIV-GCM, etc. solve your problem?)

From what I understand, SIV mode still uses a nonce, but it doesn't explode if
you reuse it for different messages. Are you aiming for nonce-less encryption?

The end goal that was being worked towards in the article (Argon2-derived
Bloom filters for partial searching) is a little more useful than nonce-less
encryption.

~~~
piotrkaminski
I was specifically after a deterministic encryption scheme that would allow
for exact-match comparisons and whose deployment was fairly foolproof (so a
non-expert couldn't screw it up too badly), which AES-SIV appears to provide
nicely. Another benefit in my case is that I'm using a NoSQL database where
separating the encrypted value from its indexed value is not always practical,
so having a single value that serves both purposes is highly desirable.

I'm not familiar with the other schemes you mentioned, but the practical
reason for using AES-SIV is the converse of yours for not using it: there's an
implementation for JavaScript. :) It's unlikely there are implementations for
the other schemes you listed.

I agree that the Bloom filters are a nice natural outcome of your approach,
though it would also be possible to combine AES-SIV for primary encryption
with a truncated HMAC for partial match Bloom filters.

~~~
CiPHPerCoder
Here's the killer feature of my approach: Each filter has its own HMAC/Argon2
key, so it becomes less useful for what I like to call crossword puzzle
attacks. (I hope that's easier to understand than a wordy explanation of how
such an attack would work.)

------
ge96
yay PHP/SQL (stack I work with, well MySQL)

