> In this paper, we propose an efficient algorithm for logistic regression on encrypted data, and evaluate our algorithm on real financial data consisting of 422,108 samples over 200 features. Our experiment shows that an encrypted model with a sufficient Kolmogorov Smirnow statistic value can be obtained in ∼
∼
17 hours in a single machine.
I need an eli5 for this, attempt at reading makes it sound like i would be able to infer an approximation of encrypted data in the examples that they provide.
Basically, if this works as the headline suggests, and my poor understanding of the paper implies we can take their example of competing financial institutions:
Two banks share data, that is intended to be secret, and they use this technique to compute some property over the data. Each bank can then do the same computation over there data alone, and compare the resulting model over their data and the result over their competitor’s. From that they can infer the properties of their competitors portfolio, which seems to be leakage of data that is not intended to be possible.
Hence I am clearly missing something, which isn’t unexpected as HE is unintuitive to me.
The point of the paper was to operate over data without decrypting, the exact scenario they gave was competing finance companies, that are clearly not going to share keys.
Also if you had all of the decryption keys you’d just decrypt the data and use the raw data. They explicitly state that it is only fast in the context of HE problems - being multiple orders of magnitude slower than techniques you can use on raw data (they actually said “fast” - complete with the quotes, which I appreciated)
I agree the example is weird, but that's literally what the paper says:
"In the training phase, it takes as input an encrypted training data and outputs an encrypted model without using the decryption key. In the prediction phase, it uses the encrypted model to predict results on new encrypted data." Figure 1 further implies that the results must be decrypted.
This is the typical operational setting of homomorphic ML.
I think the aim is to be able to store encrypted data in the cloud or some other potentially insecure location, and perform machine learning on it without decrypting the data or the model.
My understanding of this paper/space is patchy-at-best, but I am I right in thinking that this could only ever work for models where both the outcome-variable and the predictors are entirely categorical, and the input data is always encrypted with the same key?
(E.g. so that the value 'male' in the 'sex' column has the same ciphertext across subjects/observations).
If this is the case, then couldn't a standard logit model be used? (It needn't know what it's categories 'mean', either x or y).
These properties would be fulfilled by, e.g. the MNIST example in the paper (and it wouldn't take 17 hours to train!). Does anyone know where the technique in the paper would work but the one described above would fail?
The authors should do a comparison with recent work on HE for inference on other model types. Logistic regression is a subset of neural network inference, and there is significant prior work that appears to be competitive with their results.
Recent work [0] shows that MNIST, for example, can be done in a model secure/data secure way with online 30ms inference latency. For ~500,000 samples, evaluated sequentially, that's roughly 5 hours, which beats their time by about 3x. Perhaps I'm missing details in my read of the paper.
Gazelle paper uses an interactive protocol ("combination of homomorphic encryption and traditional two-party computation techniques") compared to thread paper.
Oh? Is that how they do it? I recall getting confused about the security guarantees of some HE paper the was on HN a while ago, and someone finally explained that you only get the security guarantees if you have randomness injected.