Edit: I'm asking because I don't know, not because I'm saying it's possible.
Edit2: Also, it appears there was a pattern to SSN issuance prior to 2011, so the problem space may be far short of a billion. No numbers start with 00 for example. Also, there's a listing of numbers not ever issued, and a table of "highest assigned". Since SSN's issued after 2011 are children now, they can be safely skipped, as they aren't as interesting for fraud. Also, if the State of residence is cleartext, you can search that space first if you assume a large number of people live in their birth state.
Even without the key the system is still vulnerable to frequency analysis. For example if it were used to store passwords an simple sort and count would reveal accounts that have common passwords which could then be brute forced using lists of the most common passwords found on the internet. The suggestion of using blind indexes for searching on partial values (i.e. the first letter of a name) would increase the vulnerability.
Even if the attacker doesn't have access to the key if the application is fast enough he may be able to brute force it by passing every possible value and seeing how it encrypts if the search space is small enough (i.e. common passwords).
Secure key storage is always a challenge and the suggestion that the key is stored on the application server instead of the database has some practical problems. The key in many cases will end up being duplicated in code repositories or build scripts which increases your vulnerability to internal attackers. A better system stored half the key on the database server and the other half on the application server which are then XOR'd to get the encryption keys which can reduce the number of potential internal attackers that could recover the plaintext.
Also when recovering from an attack in many cases it may be difficult or impossible to know how many systems were compromised. Without the ability to prove that both the data and the key weren't compromised in an attack an auditor may require that you operate on the assumption that the data was compromised with associated requirements for notification and penalties.
Banks long ago solved the problem of secure key storage with the use of hardware security modules (HSMs) where the key is stored and used on a secure, tamper proof device. This turned the problem of secure key storage into a problem of physical security which the banks were already good at rather than digital security. Where is my key? My key is right here.
These modules are now becoming more commonly used in the ecommerce space (AWS now provides them, for example) but they are still fairly difficult to work with. Some shops will used hardened/audited servers that run encryption/decryption micro-services instead.
I would hope, very strongly, that nobody would encrypt passwords, but instead, hash them: https://paragonie.com/blog/2016/02/how-safely-store-password...
You're not bruteforcing H(m) here. You're trying to bruteforce H(m, k) without knowing k. To make matters worse, k is a random string of 128+ bits, generated by a CSPRNG.
Edit: Thanks, got it now. A database dump on it's own gives nothing useful. The client side is vulnerable, but has to be since it's ultimately serving up plaintext SSNs anyway.
- The webserver and database server are on separate bare-metal
- The database gets compromised, not the webserver
If you can keep the webserver secure, and on separate hardware from the database, you can protect against some attacks rather than no attacks. And if your database server is used by multiple verticals within a single company, this is an even more defensible design decision to make.
Protecting the client side would be a completely different approach. Outboard/upstream tokenization or something. If it's directly serving up the sensitive data, there's no real way to leverage encryption for protection there.
Okay, I've added a section that spells this out explicitly and unambiguously. https://paragonie.com/blog/2017/05/building-searchable-encry...
From a security standpoint you don't change the threat model for the application. If someone gets unauthorized access to the server you are already screwed since your application can get access to the plain text data in order to serve requests.
If you threw away that computation for every request I would agree that it is wasteful but we don't need to do that if we rethink the role of the database in our application.
There are defensible positions for which ECB mode is acceptable. I wouldn't classify the scenario you described as one of them.
> If someone gets unauthorized access to the server you are already screwed since your application can get access to the plain text data in order to serve requests.
If someone gets unauthorized access to the webserver, yes, it's game over.
If someone gets unauthorized access to the database server, and it's a separate machine from the webserver, then your analysis here isn't applicable.
That's the threat model alluded to in the beginning of the article, but if it's not your threat model, then you may reach different conclusions about the best way to protect data. (You may not even use encryption, for example.)
If you (for example) encrypt first names of people (or any other data point that is not unique per entry) using this scheme, then HMAC will reveal all the rows that have the same first name. You can then use frequency analysis to determine with high probablity what the encrypted names are.
This is where things get difficult to explain, because most health care programs are going to care about compliance first and foremost. So with that in mind:
1. You probably don't need to encrypt their first name to be e.g. HIPAA compliant, but...
2. Using a very short Bloom filter increases the odds of false positive collisions. Combine this with Argon2 and aggressive rate limiting, and now you've frustrated frequency analysis and chosen plaintext attacks greatly.
Given the threat model that we've given (database is not the same machine as the webserver, and the database server is what gets compromised), I can't see a better solution.
$nonce = sodium_crypto_generichash($message, '', 24);
$ciphertext = $nonce . sodium_crypto_aead_xchacha20poly1305_ietf_encrypt(
However, in my experience, when it comes to deterministic encryption, most PHP developers will use ECB mode, or CBC mode with a NULL IV.
What property are you hoping to extract out of AES-SIV in particular? (And would AEZ, HS1-SIV, AES-SIV-GCM, etc. solve your problem?)
From what I understand, SIV mode still uses a nonce, but it doesn't explode if you reuse it for different messages. Are you aiming for nonce-less encryption?
The end goal that was being worked towards in the article (Argon2-derived Bloom filters for partial searching) is a little more useful than nonce-less encryption.
I agree that the Bloom filters are a nice natural outcome of your approach, though it would also be possible to combine AES-SIV for primary encryption with a truncated HMAC for partial match Bloom filters.