Among the telemetry data: > MacAddressHash - Used to identify a user of VS Code....

alkonaut · 2023-10-27T06:37:06

Unless there is any PII associated with the pseudonym, there is nothing specifically in GDPR that says you can’t or shouldn’t do this so long as it’s not information that can identify a physical person. Note that being able to attribute multiple pieces of data to the same anonymous person does not necessarily identify them (and it’s important to not accidentally do so):

It’s important though if you e.g have multiple products to use a _different_ pseudonymization (hash salt or whatever) otherwise you run the risk of storing data linking too much data on a user thereby de-pseudonymizing them in the worst case even though no individual app does. Having a users behavior across multiple applications could pose such a risk in extreme cases.

Edit: I think it's important to separate "hashing" and "hashing". A properly hashed identifier uses a salt that is generated on the client, so that it can't be used to identify the user. basically: the first time the app runs, you generate a random salt which is only stored on the client, and NEVER sent in telemetry. Anything you would like to transmit over the wire that would risk identifying the user (E.g. a computer name, mac address) you hash with this local salt. This way no one can try to go to the database on the server side and try to match any data e.g. check if the hash abc123 matches the computername jimbob bcause hash("jimbob")= abc123. Just sending hash(MacAddress) without a local random salt would NOT be properly pseudonymous because an attacker on the server side could ask and answer the the question "Does this come from the address macaddress?".

jeroenhd · 2023-10-27T19:13:06

The hash used, at least when Iooked into it last, was a plain sha256 hash, no salt or pepper. That's a unique identifier.

I think the massive amounts of behaviour analysis Microsoft does should be considered PII. They know when you turn in visual studio in the morning, and when you leave. They know when you go to lunch and don't click any buttons for a while, and they can see the colleagues with you in that boring meeting also not clicking any buttons at the same time. This type of behaviour analysis over time can associate you and the people you interact with, even if it's not directly tied to a reversible hardware ID.

This is why pseudonymisation isn't anonymisation, and why pseudonymisation isn't sufficient to comply with laws liker he GDPR.

If the behaviour analysis was done without identifiers at all, you could say they're just counting button clicks, but they intentionally associate this data with your stable personal identifier for analysis over time.

MAC addresses aren't that big of a collision space either, any consumer GPU can generate a list of all hardware MAC addresses in use in a reasonable amount of time. MAC addresses may theoretically be 2^48 in size, but most of the space hasn't been assigned to vendors yet. It takes about 12 minutes to reverse any given MAC address when you rent a single cloud GPU. The double hashing should take about twice that time.

The weird thing is that Microsoft intentionally chose to use a MAC address rather than a UUID like they use on their web version. If this was just a unique user token, they wouldn't need to use any hardware identifiers at all.

trashtester · 2023-10-27T09:15:16

You are right in the edit. The hash needs to be using a secret salt that is unavailable to any potential attacker to not be PII.

You're mixing up the termso psedonymization and anononymization, though. If something provably not PII, it is considered anonymous. Psedonymization specifically means to keep the data as PII, but where the risk of misuse is reduced by making the identification hard.

In practical terms, psedonymous data is data that someone like a data scientist will only be able to link to a person if making a deliberate effort to do so, which will almost certainly mean that she KNOWS she is breaking some law. And it may also mean that the link between the person and the pseudonym is stored in a locked down database where most data scientists (or others that may have interest in doing the linking) do not even have access.

The GDPR does promote the use of pseudonymization as a layer of protection, and if a business does keep some PII data around, properly categorizes their data as such (in compliance with Article 30 of GDPR, with a defined "Legal Ground" for processing activities) AND properly protects the data both through "Security by Design" and "Privacy by Design" (of which pseudoymization is an important element), their legal exposure can be either completely negated or at least radically reduced if the "Legal Ground" is challenged.

Overall, though, fully understanding GDPR is terribly difficult, as it requires significant understanding of both Law (International AND local within each country covered by the GDPR), Computer Science (development AND IT security) AND a good understanding of Data Science.

I rarely meet people with enough understanding of all 3 to assess practices that are in the gray zone.

Lawyers (and most DPO's) tend to have little understanding of the IT or Data Science aspects, but tend to be good at stretching a "Legal Ground" to whatever is needed by the business to continue to be profitable.

Data Scientists tend to know how to de-pseudonymize data, and may even be taught "Privacy by Design" (this usually has to be forced on them, though, as it makes their job harder). Most data scientists struggle with IT security aspects, though, and would in many cases happily download all data to their laptops if they could.

Developers/engineers may understand concepts such as hashing, and even know the difference between hashed and encrypted data. However, as they live in a boolean world of True vs False, using judgement to evaluate the risk impact of some practice for data subjects tends to be alien to them. In a black and white world, this group tends to think that every bad practice is equally bad, instead of going for the "lesser wrong" or "good enough". Especially if the measures needed to be "good enough" makes the coding harder or the system slower.

Finally, IT security (the experts, not the drones) MAY have a better understanding of degrees of risk than developers, but tend to know/care less about the actual data than any other group.

And each group tend to hold the other groups to a higher standard than their own. The lawyers tend to assume that all aspects of development and infrastructure is properly hardened. Data Scientists tend to interpret the "Legal Ground" to cover whatever they want to use the data for. Developers tend to think that the infra that runs their systems is fully secured by shell protection, and may even store "secrets" in more or less open git repos (and even if they delete it later, they don't clean up the git history or create new secrets). And networking often do not even care about anything in the "Application Level" or higher of the networking stack.

So in practice, any large corporation will have a huge number of vulnerabilities. The only way any sensitive asset (from a privacy, intellectual property or operational stability perspective) can be considered properly protected is to have multiple layers of protection, all or most of which must fail for major incidents to happen.

alkonaut · 2023-10-27T14:57:36

I use pseudonymization in the sense of having persistent identifiers for users/machines/etc that cannot be reversed on the server side.

Basically: just like the usernames on hn are pseudonyms it’s important they are persistent so you can follow who wrote what despite not being able to attribute posts to physical persons. That is: hn is a pseudonymous forum rather than anonymous.

The hash(localSalt + PII) is provably not PII. But it’s still making the data possible to correlate. The telemetry event I send on Monday can be attributed to the same source as the event I send on Tuesday.