Isn’t it pretty standard practice to confine (overt, obvious) PII to the users t...

ryan_j_naughton · on Jan 26, 2018

Not if you have credit reports and other data that could contain PII as well. You don't want the name or address in the credit report to just point to your user's table, with an assumption that that was the name in the report. And if you don't know with certainty that the credit report pulled was for that person (there are collisions, mismatches, etc), then you want to store that data as associated to the report and not the user (which the report is then associated to the user).

For proper data integrity and data provenance, you want to know what you knew at each point in time. Thus, simply pointing to a user_id and hoping the data on the user's table was the data at some point in time in the past will result in leakage for data science (https://www.kaggle.com/wiki/Leakage).

closeparen · on Jan 26, 2018

>You don't want the name or address in the credit report to just point to your user's table, with an assumption that that was the name in the report

Well ok. In your example, call it a "credit report subjects" table.

>Thus, simply pointing to a user_id and hoping the data on the user's table was the data at some point in time in the past

References can be versioned or even hashed, i.e. Git. You would run the same risk with tokens, no?