
Ask HN: De-identifying data without destroying quality - monkeydust
We are exploring how to aggregate significant amount of data from different clients to use in a machine learning model.<p>The output of the model will be shared with all clients.<p>The clients require the model to see data that has been stripped of sensitive information. We could crudely just remove columns that we felt were sensitive but this would impact performance of the model.<p>Has anyone got experience or thoughts on how to approach this?<p>Any software &#x2F; open-source on not that could help?<p>Txs
MD
======
jklein11
Do you need the data to be anonymized or just de-identified?

I know this won't sit well with privacy minded folks, but if you just need the
data de-identified and not anonymized, you could pick the fields that might
contain sensitive data and do a character for character swap. This way you
retain the information without storing the personal information in a raw form.

------
2rsf
What do you need to anonymize ? what type of system are you asking about ? do
you need to comply with GDPR ?

names, addresses, id numbers or account numbers can be easily randomized.
Dates and numbers (what kind of system is it?) are trickier since they are
used in calculations. Finally the tricky part is making sure that the
anonymized data still can't be tracked back to real entities.

