Now, the above guidance is related to PECR rather than GDPR, which is what your post is about. But, given the above, do you think that your software is compliant/exempt from PECR or do you think that organizations will still have to take extra steps to be compliant with privacy legislation?
We don't consider ourselves to be building any sort of 'server side cookie', especially since an anonymous hash is only ever tied to one piece of data and is actually set to null as soon as another request comes in. Unlike cookies, data doesn't follow the user around as they browse the site. A cookie would stick with you as you browse a website.
We've spoken with a few lawyers about this and there's too much grey area at the moment. Time will tell and we're hoping that the UK (my home country) sort PECR out.
AFAICT, v1 of PECR awkwardly applies whenever the cookie is not functionally directly necessary for the service that the user is using. PECR applies even if, like here, the cookie is just for counting unique numbers of visitors, and is not used for fingerprinting individuals.
The draft v2 of PECR contains an exemption for first party analytics. I think this maybe strikes a nice balance: explicit consent would still be required for the more-harmful third party analytics.
Not sure when v2 of PECR will happen. It is years overdue. Perhaps it is a priority for the newly elected European Parliament and the new Commission?
FWIW, my guess would be that the definition is fairly strict.
Now, I don't think that would prevent the first party using data processors, but I suspect the first party would have to exercise a lot of control over the processor. This would be in contrast to a service like Google Analytics, where the company's control and choice is simply limited to take it or leave it.
If I want to integrate this with a single page app (like Ember or React), are there enough API hooks on how I can track click events and page load events, etc in the JS? We threw together Google Analytics for our launch just so we would have SOME data, but we want to move away from it ASAP for privacy reasons.
Anymore about how it would compare to something like Piwik? (the product we're looking at).
"For tracking unique page views"
navigator.sendBeacon("/unique-pagehit?" + encodeURIComponent(location.href));
I'm not sure I fully understand what is being measured (is it session-only?). For the duration someone watched a page, you can use sendBeacon in onBeforeUnload. To detect a bounce, set a Math.random() in a session variable, send it at the start of the page, and have every page load send the previously stored random variable. Then count the unique random keys you received on the server - those are the bounces.
I know, in practice you'll need to trim sessionStorage, sanitize URLs, use something less-colliding than Math.random, dealing with new tabs, some polyfills and other robustness, etc... but I don't yet see why the tracking mentioned needs any user ids or hashing at all.
The community version is getting a full update soon. We just have to focus on profit a bit (this keeps us in business and able to update the repo).
I cannot give you any tips on managing your time between the two, but you may want to consider raising your prices. Back in 2011, we ran our basic account at $50/mo, Business at $150/mo and Enterprise at $CallMe/mo. We could probably have upped that after a year with all the new features we'd put in, but we were acquired before the 2y mark and dropped the first 2 plans.
We still maintain the open source version of the code, but neither of the original founders work on it (for me it's gone back to being a hobby because there are now people paid full time to work on it). We still get questions about the lag in updates. We typically do quarterly bulk pushes to the open source version now.
When I heard that the new version wasn't going to be open source it was disappointing- both for the simple fact that I value open source, and also for how it wasn't really announced so much as heavily implied until I actually asked. Knowing that you've listened to the community and are going to open source the new version as well is a huge relief for me, and will definitely result in me promoting your product again.
I would say from a communication standpoint things could have been clearer, and I'm sure you'll work on that. From a timeline standpoint I think it would be nice if the open source and hosted versions eventually were released together, but with a full rewrite I also understand that you probably want to clean it up and make sure it's in good shape before doing so. That being said if you want some help with alpha testing the open source version I'd be happy to assist, and I'm sure others would as well.
However, since such businesses already need to collect personal info as part of your account creation it shouldn't be hard to build analytics on top of that existing PII. If they are already collecting PII it doesn't seem to save much to have their analytics tool avoid it?
What the article is discussing looks (at first brush) to be a sensible way of aggregating users up-front before it hits the database, rather than later. So no personal data is stored.
Does this meet the requirements for a site to avoid notifying users under the GDPR? I have no idea.
Even with the best of intentions, if you use a service like this then you are relying on them a) doing what they claim, and b) not screwing up (by leaving logs around, etc).
If I use this service and data from my users gets leaked by Fathom, who gets blamed? The users were on my site, so I guess it is I that gets fined. Maybe the risk is worth it, maybe it isn't.
"64. The ICO cannot even take action directly against a processor who is entirely responsible for a data breach, for example by failing to deliver the security standards the controller has required it to put into place. However, in these cases the ICO may decide not to take any enforcement action against the controller if it believes it has done all it can to protect the personal data it is responsible for and to ensure the reliability of its processor, for example through a written contract. However, whilst the ICO cannot take action against the processor, the data controller could take its own civil action against its data processor, for example for breach of contract."
Though it goes on to say that in some circumstances, the processor can _become_ a controller, in which case the ICO can go after it.
If you ran a SAAS and used Fathom analytics, a SMTP email provider (to send password resets), a newsletter provider (for your monthly newsletter), a blog host etc. each of those would be data processors as you (the SAAS) are making the determination of where user data is going on the backend.
As a data controller, it's your responsibility to make sure that each of those services you are using is handling the data in an appropriate and safe way.
As to who gets fined if there is a data breach: the answer is likely nobody. I say this because it's not like you have a credit card on file and breach automatically means a fine. What really matters is what actions you're taking before and after a breach.
- As a data controller did you notify your users with no undue delay?
- As a data processor did you notify the SAAS with no undue delay?
- Did you identify the source of the breach? Did you take steps to remedy it?
For almost all these privacy regulations if you:
1. Take steps to protect user data/privacy like https, encryption etc.
2. Provide a mechanism to allow users to make data requests for their own identifying data
3. Notify users if there's a data breach
You would be in compliance.
Not necessarily. It's a bit complex, but the fact that Fathom ingests personal data at all likely means that they must still be disclosed by whoever is using Fathom's code.
On the other hand, if Fathom were able to push data ingestion into the first party's infrastructure so that only aggregated data hits Fathom's own infrastructure, I'm pretty sure that would put Fathom in the clear due to the fact that they are no longer processing or storing personal data.
To Fathom's credit, I am a big fan regardless of the steps they seem to be taking with anonymization, and I will consider using them instead of GA for future projects.
I went and reread parts of GDPR, and I think you are right, though not because of Article 26. I think it's pretty clear that after your ingestion the data has been in good faith anonymized to the degree that it is no longer personal, and therefore your analytics code should be exempt from consent rules.
The interesting question to me is whether a controller deciding to put your pixel on a page for analytics purposes counts as you processing on the controller's behalf to an extent requiring consent. I don't see any clause specifically regarding third party access to personal data (as opposed to third parties processing personal data), so I agree with your stance that it's most likely fine.
That said, the articles of a regulation should be read in light of the recitals (as the latter are the rationale behind the actual binding legislation).
I would have more faith in privacy, if you didn't store the salt in the DB or permanent storage. If you manage to statically load-balance the users (e.g. hash site, ip, user-agent, don't forget site), the hash could be in-memory only. Sessions would break on server restart, but that's more of a feature.
To move thing further, you might not even need to store the hashes in the DB. Keep them in server memory only and (real-time) update aggregate data in DB.
It's an interesting idea. We have multiple servers under the load balancers, so we'd be able to store them in Redis, but that is no better than permanent storage, since Redis could still be breached and you'd see it with ease.
I wondered whether you could explain what makes your hashing different from the hashing used by Facebook for their custom audiences tool which was deemed unsuitable for anonymisation as per https://www.spiritlegal.com/en/news/details/e-commerce-retai...
I use fathom on one of my sites and I like it. I was just replying to someone who said people could just use access logs instead. Some don't like adding JS to their site I guess.
Ad blockers will also block requests to the tracker.js if you use analytics.domain for example.
(I mean, I don't have a point here but I find it pretty interesting. xD)
Is that typical for privacy-focused tech sites? I would have expected a lower percentage.
(DDG fan here, but people look at me funny when they see me using it...)
However, this is absolutely a cookie. Scraping just enough information from the browser to create a unique but stable hash and then having the browser compute it every request isn't at all different from that browser information acting as the cookie.
Nevertheless, the chances of identifying someone are probably pretty low, and it´s a good effort to make analytics more privacy friendly.
() Assuming IPv4. Obviously IPv6 addresses would be much harder to brute force, but still easier than a 256-bit hash.
I mean, hey, here's a hash: 26246226167b9f190d3a1ce726efe07ae18bbf0480a78d19390b9aaf13f25cb0
Imagine you just got hold of it through a data breach. I'll give you $1,000 if you can get it de-hashed before midnight Chicago time ;)
So one of the reasons we posted to HN was for conversations like this. Reading what you put makes me think we need to do more when creating the PageRequestSignature and the SiteRequestSignature, because if someone had access to our server, got the hash, stopped all cron jobs from processing data, then they could work it out after a huge amount of time / computing power. But to be honest, at this point, they could also add log($_SERVER) and get the entire request body of a user, so we would have much bigger problems in that scenario.
Anyway, you've given me a few new thoughts on hardening from data breaches. Because, yes, if they know the hash & have full access to our server then it becomes easier. So we almost need to move it to the point where they won't know the salt that was used for a particular user. So we'd not recycle a single salt, we'd recycle multiple salts that are based on, perhaps, the first 2-3 digits of a users IP address combined with the last 2-3 digits.... Then it would be much harder to break without first knowing the $_SERVER dump (the user's IP etc.). Obviously this wouldn't stop a "complete control of server" attack where they could just start logging everything but it would really ruin a brute forcers day because they wouldn't know what salt to start with.
What do you think of that idea? I'm running on little sleep so be nice ;) Also thanks so much for all your feedback so far, it's so appreciated!
Of course, if an attacker has access to the salt and is interested in the pages viewed by a specific known IP address then this all becomes much simpler. Then the only unknown is the user agent, which is relatively low-entropy.
> So we'd not recycle a single salt, we'd recycle multiple salts that are based on, perhaps, the first 2-3 digits of a users IP address combined with the last 2-3 digits....
If an attacker is already brute forcing the IP address and has the ability to obtain the salt then they would just use the correct salt for each IP address. Making the salt depend on other data already included in the hash doesn't change the size of the search space.
If they had the salt, IP and user agent, they’d have to also brute force every possible hostname and pathname, which would be insane. But I suppose they’d only have to do a few million based on the data we have on hostnames / pathnames...
Lots of ideas for improvement are popping into my head and I love how this community keeps challenging you to improve things. We had feedback on Reddit but it was much angrier!
The next step is to take your feedback and look into how we would defend against the scenario you’ve provided. Thank you!
> Brute forcing a 256 bit hash would cost 10^44 times the Gross World Product (GWP). [...]
> We have rendered the data anonymous to the point where we could not identify a natural person from the hash.
> It's possible that GDPR does not apply to Fathom since data is made completely anonymous. Even if GDPR did still apply, we reiterate the stance that there is legitimate business interest to understand how your website is performing.
This seems to imply a profound confusion between the difference of hashing vs. anonymity. Just because it's hashed doesn't mean it's anonymous! You don't need to "brute-force" the hash, you just need to find a user that matches your hash... which is 1 in 7 billion (or so), much more tractable. This is also the principle e.g. MD5 rainbow tables are based on...
They claim to change the hash every 24 hours, so it's equivalent to having a session cookie with 24-hour expiration (session cookies are "anonymous" by their definition, they don't have any user information and they're impossible to "brute force", they "just" enable tracking). I've no idea if 24-hour session cookies are GDPR-compliant...
Regarding (2), given that this seems (again, I might be misunderstanding) equivalent to a 24-hour session cookie, why not just do that? However, then you're ... drumroll ... giving control to the user. Why not just give control to the user, period?! For example, by storing the list of pages visited in Local Storage, and only pinging the server once for each page(view) every 24 hours?
> You don't need to "brute-force" the hash, you just need to find a user that matches your hash... which is 1 in 7 billion (or so), much more tractable. This is also the principle e.g. MD5 rainbow tables are based on...
Not quite. We use a SHA256 hash as our salt, and that changes each day, so you'd need to brute force that.
In terms of how many possible combinations there are for this salt, please see: https://stackoverflow.com/a/49520766 - you would need to brute force it and try each possible combination with every single possible IP / User Agent / Site combination to break a hash. This is why it's not theoretically impossible but it's practically impossible.
We would love to approach things in an easier way but PECR doesn't want cookies, even anonymous ones.
Now, one thing that we have uncovered thanks to someone on here is that we need to increase our resistance to data breaches. If someone had complete, unlimited access to all our data / servers, including the daily salt, then they could de-hash page views from the last 30 minutes. I have no idea how long that would take. There are 4,294,967,296 (?) possible IP addresss, and then over 3M (?) user agents, so it'd be an absurd, pointless exercise.... Anyway, we're going to be bringing in multiple salts that depend on the user IP address, meaning that, in the event of a data breach, a hacker won't know which salt has been used for a hash :) Perhaps we base the salts on the first 3 digits of an IP address? That would mean we'd have 720 possible SHA256 salts!
Also, for non GPDR IP blocks, maybe just store a per client salt in a cookie(!) and then xor it with the rotating server salt.
What's the difference between a page view and a visit?
Thanks for building this, I will promote it.
I've been looking for a privacy-first tracker for some personal sites for a while, but nobody is offering pricing that makes sense at the lower end.
I'm not sure this is a good idea.
Analytics is required for business and isn't going anywhere. The laws don't feel like they are trying to shut down analytics completely, they are just asking this type of software to do better. That's what I think we are doing with this—and there are no other analytics companies who come close to our level of obfuscation and non-tracking of personal data.
If the intent of the law is do better with privacy and data, we are doing it to the best of our abilities. It's not a skirting around the issue, we are agreeing with it in our code and logic for how our tracker works.
Alice visit a site and gets the hash 1234. The analytics data is stored and associated with hash 1234, but soon after, hash 1234 is removed. However the aggregate visitor analytic that was associated with hash 1234 data persists. Then another user (say Alice again) returns and gets hash 5678. Analytic data is tracked, stored with hash 5678 for the 30 minutes (or less), and then hash 5678 is again removed. However the analytic data that was associated with 5678 is aggregated with the rest?
My guess is they use localStorage and sending the hash to their servers with each request.
So we are talking about a mechanism that’s just like a cookie.
As long as they don’t have any PII and can’t figure out who the user was, then I think the GDPR gives them an exception.
But “without cookies” claim is dubious!
Random SHA256 String (daily regenerated)
Day of the year
So basically the only drawback I see is that all employees from behind one corporate NAT using same browser will count as one user.
But don’t see a way around that if you can only use IP and User Agent strings.
None of the information you are using on the hash wouldn't be in the search query itself! ip, user agent, path, date, etc. So there is no way to reverse the hash. You just hash your search query and compare in O(1) time.
The only piece of information that realistically makes the hash slightly difficult to get is the random number refreshed every day. But either you store it (and i have no reason to believe you do not) or it make the brute force effort trivial as I only need to generate the hash with that variable now.
And yes the daily hash gets stored until midnight. But what are you talking about with 'search query' containing IP, user agent etc.?
Also I suggested you store the daily hash forever. But even if you really erase it every day, as you say, If you or an attacker makes the same request every day at a predetermined time, when you/they get your logs, you/they can use that predictable request to get the daily secret too.
I consider the information to be stored in plain text, and that you would have to have requested permission just the same. You pretty much have an identifiable user (via IP/UA/access time) stored in your logs.
Anonymization is removal of information, not encoding it in a convoluted hash.
And a hacker could indeed "win" if they broke into our system, got the salt and exported the DB. We didn't focus on this in our article, as it's unbelievably unrealistic, but it's still possible. Our next step is to address that.
Without the hash, it's practically impossible to brute force.
My point about brute forcing being useless, is that you hold all the information needed to re-create the hash. All but one tiny piece that is the random number. so brute force is a very effective O(<tiny piece size>). And since it is stored in your locally available data, there is no rate constraints.
Under your logic, you would never trust us because we could just add $log->write(UserIp, UserAgent, Hostname, Path) in plain text. Trust is very important and what you do with the data is important under GDPR.
And we don't hold all the information to re-create the hash, that's the thing.
I thought a lot about "Oh but you could just do this, this and this" but, no, that argument doesn't hold. Our obligation under GDPR is what we actually do with data.
Yes you can, particularly if you correlate across different websites.
So if you take a look at Recital 26 (https://gdpr-info.eu/recitals/no-26/):
> To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly.
> To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.
> The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.
> This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes.
So the piece about the principles of data protection not applying to personal data rendered anonymous is crucial. We believe that GDPR does not apply to us because of that. But even if GDPR did apply to us (we'll assume it does, that's always the best way to be), then our legal basis is that there's legitimate interest. As a website owner, it is in your legitimate business interest to understand how your website is performing - e.g. the most popular pages, the pages where people linger for longer, the pages where people bounce.
‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;
- a hash number falls into this. You cannot just quote recital 26 and stop reading since you found it fitting. Recital 30 covers the case for "other identifiers" that might replace cookies. No hard feelings we all do.
The data might be anonymous for a third party but if you can single out just one person or in other word one unique visitor it is not anonymously. NB. One IP poisons the whole data.
So your fallback is Article 6(f) which is reasonable but you can not assume the interest of a site owner is always higher than the interest of the visitors. You have to put your arguments into writing and have the means for people to appeal it.
6f is not meant as a blanco cheque or batch job...
The thing is, you can't create profiles. So right now I could give you a single entry for our website
> NULL "" "https://www.usefathom.com" "bb9377f4cf33093765835a48e962a5dbd3168499abd12b120c8c118c86c41479"
How could we possibly use that to profile / identify? The hash (bb9377f4cf33093765835a48e962a5dbd3168499abd12b120c8c118c86c41479) is unique in the database table and never repeats.
I hear you. We don't rely on Recital 26 to comply with GDPR. I've not had the Recital 26 piece confirmed by a lawyer but it's a personal hunch / exploration. Hearing your comments on Article 30 are helpful, thank you, would like to hear your thoughts on my reply if possible :)