Hacker News new | comments | ask | show | jobs | submit login
Building Searchable Encrypted Databases with PHP and SQL (paragonie.com)
59 points by CiPHPerCoder on May 31, 2017 | hide | past | web | favorite | 30 comments



Can someone explain why it's not simple to brute force the hash in the blind index? Since SSNs only have 9 digits, that's a billion possibilities. Is this just a hash that's too slow to iterate over the billion? Is each row salted differently?

Edit: I'm asking because I don't know, not because I'm saying it's possible.

Edit2: Also, it appears there was a pattern to SSN issuance prior to 2011, so the problem space may be far short of a billion. No numbers start with 00 for example. Also, there's a listing of numbers not ever issued[1], and a table of "highest assigned"[2]. Since SSN's issued after 2011 are children now, they can be safely skipped, as they aren't as interesting for fraud. Also, if the State of residence is cleartext, you can search that space first if you assume a large number of people live in their birth state.

[1]https://www.ssa.gov/employer/stateweb.htm

[2]https://www.ssa.gov/employer/ssnvhighgroup.htm


HMAC is a keyed hash which renders simple brute force attacks ineffective unless the attacker is also able to compromise the key.

Even without the key the system is still vulnerable to frequency analysis. For example if it were used to store passwords an simple sort and count would reveal accounts that have common passwords which could then be brute forced using lists of the most common passwords found on the internet. The suggestion of using blind indexes for searching on partial values (i.e. the first letter of a name) would increase the vulnerability.

Even if the attacker doesn't have access to the key if the application is fast enough he may be able to brute force it by passing every possible value and seeing how it encrypts if the search space is small enough (i.e. common passwords).

Secure key storage is always a challenge and the suggestion that the key is stored on the application server instead of the database has some practical problems. The key in many cases will end up being duplicated in code repositories or build scripts which increases your vulnerability to internal attackers. A better system stored half the key on the database server and the other half on the application server which are then XOR'd to get the encryption keys which can reduce the number of potential internal attackers that could recover the plaintext.

Also when recovering from an attack in many cases it may be difficult or impossible to know how many systems were compromised. Without the ability to prove that both the data and the key weren't compromised in an attack an auditor may require that you operate on the assumption that the data was compromised with associated requirements for notification and penalties.

Banks long ago solved the problem of secure key storage with the use of hardware security modules (HSMs) where the key is stored and used on a secure, tamper proof device. This turned the problem of secure key storage into a problem of physical security which the banks were already good at rather than digital security. Where is my key? My key is right here.

These modules are now becoming more commonly used in the ecommerce space (AWS now provides them, for example) but they are still fairly difficult to work with. Some shops will used hardened/audited servers that run encryption/decryption micro-services instead.


> Even without the key the system is still vulnerable to frequency analysis. For example if it were used to store passwords an simple sort and count would reveal accounts that have common passwords which could then be brute forced using lists of the most common passwords found on the internet.

I would hope, very strongly, that nobody would encrypt passwords, but instead, hash them: https://paragonie.com/blog/2016/02/how-safely-store-password...


> Can someone explain why it's not simple to brute force the hash in the blind index?

You're not bruteforcing H(m) here. You're trying to bruteforce H(m, k) without knowing k. To make matters worse, k is a random string of 128+ bits, generated by a CSPRNG.


Hrm. Ok. This is assuming I've compromised the database client side, right? Doesn't the client have to know 'k' to do the query? k is $indexKey, right?

Edit: Thanks, got it now. A database dump on it's own gives nothing useful. The client side is vulnerable, but has to be since it's ultimately serving up plaintext SSNs anyway.


This is assuming that:

  - The webserver and database server are on separate bare-metal
  - The database gets compromised, not the webserver


Very big assumptions, vector of attacks tend to begin at the webserver more so than the DB. Sometimes, these DB dumps we hear off occurred over HTTP!


Right, but if you can compromise the webserver, you can get the key and defeat any application-layer encryption a PHP developer would have access to, thereby rendering any other threat models uninterestingly broken.

If you can keep the webserver secure, and on separate hardware from the database, you can protect against some attacks rather than no attacks. And if your database server is used by multiple verticals within a single company, this is an even more defensible design decision to make.


Ah, the multiple verticals angle is where this starts to make sense to me. This app might be well-secured, but the dept next door might have something worse, but this way if the DB's compromised via theirs, the data's safe, I guess?


Right, so if you are rolling your own, even better is to have an API that sits between the web server and the DB and handles the decrypting and encryption. In that middleware you can now add things such as rate limiting of data requests, logging, authorization, etc.


Seems fair enough to me, though the article could do with a lead-in that explains what's being protected, the premise that the database server is a separate machine, etc.

Protecting the client side would be a completely different approach. Outboard/upstream tokenization or something. If it's directly serving up the sensitive data, there's no real way to leverage encryption for protection there.


> Seems fair enough to me, though the article could do with a lead-in that explains what's being protected, the premise that the database server is a separate machine, etc.

Okay, I've added a section that spells this out explicitly and unambiguously. https://paragonie.com/blog/2017/05/building-searchable-encry...


I disagree with the dishonorable mention as you've only considered the naive case which you can build upon to avoid the issues mentioned. Consider a design where everything is AES-256 encrypted in the application and the database just receives an encrypted binary. You can build an in-memory cache for situations where you need to search sensitive information (first name, last name, ssn) by using a tokenizer and using a technology like Erlang term storage. It will be eventually consistent which should be fine for our use case since that type of information does not change that often and the cache recompute can be done asynchronously.

From a security standpoint you don't change the threat model for the application. If someone gets unauthorized access to the server you are already screwed since your application can get access to the plain text data in order to serve requests.

If you threw away that computation for every request I would agree that it is wasteful but we don't need to do that if we rethink the role of the database in our application.


> I disagree with the dishonorable mention as you've only considered the naive case which you can build upon to avoid the issues mentioned. Consider a design where everything is AES-256 encrypted in the application and the database just receives an encrypted binary. You can build an in-memory cache for situations where you need to search sensitive information (first name, last name, ssn) by using a tokenizer and using a technology like Erlang term storage. It will be eventually consistent which should be fine for our use case since that type of information does not change that often and the cache recompute can be done asynchronously.

https://tonyarcieri.com/all-the-crypto-code-youve-ever-writt...

https://blog.filippo.io/the-ecb-penguin/

There are defensible positions for which ECB mode is acceptable. I wouldn't classify the scenario you described as one of them.

> If someone gets unauthorized access to the server you are already screwed since your application can get access to the plain text data in order to serve requests.

If someone gets unauthorized access to the webserver, yes, it's game over.

If someone gets unauthorized access to the database server, and it's a separate machine from the webserver, then your analysis here isn't applicable.

That's the threat model alluded to in the beginning of the article, but if it's not your threat model, then you may reach different conclusions about the best way to protect data. (You may not even use encryption, for example.)


If I understand this correctly, you basically store HMAC or password hash as a separate field and index on it? That may work for SSNs (which are unique), but it won't work for other things you commonly need to encrypt, such as personally identifiable data (in healthcare scenarios).

If you (for example) encrypt first names of people (or any other data point that is not unique per entry) using this scheme, then HMAC will reveal all the rows that have the same first name. You can then use frequency analysis to determine with high probablity what the encrypted names are.


> If you (for example) encrypt first names of people (or any other data point that is not unique per entry) using this scheme, then HMAC will reveal all the rows that have the same first name. You can then use frequency analysis to determine with high probablity what the encrypted names are.

This is where things get difficult to explain, because most health care programs are going to care about compliance first and foremost. So with that in mind:

1. You probably don't need to encrypt their first name to be e.g. HIPAA compliant, but...

2. Using a very short Bloom filter increases the odds of false positive collisions. Combine this with Argon2 and aggressive rate limiting, and now you've frustrated frequency analysis and chosen plaintext attacks greatly.

Given the threat model that we've given (database is not the same machine as the webserver, and the database server is what gets compromised), I can't see a better solution.


This would depend on how large the database was. If the names in the database were non-representative of the general population then frequency analysis is not going to help much.


How is this different than CryptDB?

https://css.csail.mit.edu/cryptdb/


From what /r/crypto says, they leak metadata to the database server.

https://www.reddit.com/r/crypto/comments/6e5jjz/building_sea...


I know this is a I'm going quest and for that reason it annoys me when i see my companies competitors say they encrypt all their users searchable data and I know what they are really doing is using Amazon disk level encryption that's not really helpful if your live database gets dumped.



I agree that the HMAC approach is better but a proper deterministic encryption would work just fine, wouldn't it? You're comparing your method with ECB mode which is a bit unfair.


What are some constructions you'd consider "fair"?


AEAD cipher with nonce generated from plaintext? That would have equivalent security properties except for being needlessly complicated?


So you mean something like this?

    $nonce = sodium_crypto_generichash($message, '', 24);
    $ciphertext = $nonce . sodium_crypto_aead_xchacha20poly1305_ietf_encrypt(
        $message,
        $nonce,
        $nonce,
        $key
    );
That solves the literal search problem, but none of the others.

However, in my experience[1], when it comes to deterministic encryption, most PHP developers will use ECB mode, or CBC mode with a NULL IV.

[1]: https://meta.stackoverflow.com/q/293930/2224584


Why not just use AES-SIV instead of separate encrypted and HMAC fields?


The practical reason: https://3v4l.org/LEar1 (No SIV in that list.) It's also not present in libsodium or (shudder) mcrypt.

What property are you hoping to extract out of AES-SIV in particular? (And would AEZ, HS1-SIV, AES-SIV-GCM, etc. solve your problem?)

From what I understand, SIV mode still uses a nonce, but it doesn't explode if you reuse it for different messages. Are you aiming for nonce-less encryption?

The end goal that was being worked towards in the article (Argon2-derived Bloom filters for partial searching) is a little more useful than nonce-less encryption.


I was specifically after a deterministic encryption scheme that would allow for exact-match comparisons and whose deployment was fairly foolproof (so a non-expert couldn't screw it up too badly), which AES-SIV appears to provide nicely. Another benefit in my case is that I'm using a NoSQL database where separating the encrypted value from its indexed value is not always practical, so having a single value that serves both purposes is highly desirable.

I'm not familiar with the other schemes you mentioned, but the practical reason for using AES-SIV is the converse of yours for not using it: there's an implementation for JavaScript. :) It's unlikely there are implementations for the other schemes you listed.

I agree that the Bloom filters are a nice natural outcome of your approach, though it would also be possible to combine AES-SIV for primary encryption with a truncated HMAC for partial match Bloom filters.


Here's the killer feature of my approach: Each filter has its own HMAC/Argon2 key, so it becomes less useful for what I like to call crossword puzzle attacks. (I hope that's easier to understand than a wordy explanation of how such an attack would work.)


yay PHP/SQL (stack I work with, well MySQL)




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: