How we built a GDPR-compliant website analytics platform without using cookies

pauljarvis · on July 22, 2019

We are incredibly open to any ideas, comments or concerns on how we're doing this. This is a big step up from what we had previously, but there’s always room for improvement. Happy to hear thoughts in the comments.

lmkg · on July 22, 2019

Hi Paul, thanks for being open about this. I have a big, important question.

ICO, the agency in charge of enforcing GDPR and related legislation in England, released guidance earlier this month on the topics of cookies. One of the most notable parts of this guidance is that "device fingerprinting" is treated the same as a cookie[1]. And also that website analytics requires consent to use cookies or similar technologies[2] ("similar technologies" including device fingerprinting).

Now, the above guidance is related to PECR rather than GDPR, which is what your post is about. But, given the above, do you think that your software is compliant/exempt from PECR or do you think that organizations will still have to take extra steps to be compliant with privacy legislation?

[1] https://ico.org.uk/for-organisations/guide-to-pecr/guidance-...

[2] https://ico.org.uk/for-organisations/guide-to-pecr/guidance-...

JackWritesCode · on July 22, 2019

Totally, so we feel we follow the spirit of the PECR law, but until there's a case against it, we don't have precedent. But we feel like if analytics was under fire we'd be at the bottom of the list because we've gone out of our way to follow the spirit of it.

We don't consider ourselves to be building any sort of 'server side cookie', especially since an anonymous hash is only ever tied to one piece of data and is actually set to null as soon as another request comes in. Unlike cookies, data doesn't follow the user around as they browse the site. A cookie would stick with you as you browse a website.

We've spoken with a few lawyers about this and there's too much grey area at the moment. Time will tell and we're hoping that the UK (my home country) sort PECR out.

orra · on July 22, 2019

I think that's a fair question.

AFAICT, v1 of PECR awkwardly applies whenever the cookie is not functionally directly necessary for the service that the user is using. PECR applies even if, like here, the cookie is just for counting unique numbers of visitors, and is not used for fingerprinting individuals.

The draft v2 of PECR contains an exemption for first party analytics. I think this maybe strikes a nice balance: explicit consent would still be required for the more-harmful third party analytics.

Not sure when v2 of PECR will happen. It is years overdue. Perhaps it is a priority for the newly elected European Parliament and the new Commission?

krageon · on July 23, 2019

What makes analytics first-party? When the first party serves them, or only if the data never leaves machines under the direct and exclusive control of the first party?

orra · on July 24, 2019

I don't know the answer to that.

FWIW, my guess would be that the definition is fairly strict.

Now, I don't think that would prevent the first party using data processors, but I suspect the first party would have to exercise a lot of control over the processor. This would be in contrast to a service like Google Analytics, where the company's control and choice is simply limited to take it or leave it.

krageon · on July 29, 2019

A data processor agreement is not usually negotiated all that hard and this is indeed not really possible with truly large companies (and let's face it, that's where most companies get their enterprise software). Therefore I feel that expecting a lot of control being exercised is a bit of a pipe dream.

atonse · on July 22, 2019

I trust that there's enough questions and scrutiny about your anonymization, so I don't have any questions about that. Mine is more about implementation.

If I want to integrate this with a single page app (like Ember or React), are there enough API hooks on how I can track click events and page load events, etc in the JS? We threw together Google Analytics for our launch just so we would have SOME data, but we want to move away from it ASAP for privacy reasons.

Anymore about how it would compare to something like Piwik? (the product we're looking at).

JackWritesCode · on July 22, 2019

We <3 EmberJS. Fathom's Dashboard is built using it. You'd want to built it into the Router and call fathom('trackPageview') to log it :)

abtestdev · on July 22, 2019

How do you guard against hash collision?

JackWritesCode · on July 22, 2019

https://crypto.stackexchange.com/a/47810

abtestdev · on July 22, 2019

Based on the blog - anyone who shares a IP address (such as inside a company network) would effectively look the same.

proaralyst · on July 22, 2019

And given most companies run managed browsers on managed systems, the user agent is going to collide too

_-___________-_ · on July 22, 2019

Not just inside a company network. CGNAT is pretty commonplace nowadays.

JackWritesCode · on July 22, 2019

Indeed. This is one of the limitations to our approach. We're thinking of ways to workaround it. Please shout if you have any ideas ;)

cameronbrown · on July 23, 2019

If/When IPv6 is widely used the problem should fade away.

unilynx · on July 22, 2019

Why all the trouble with hashes - can't you just do it on the client and not having to store any data at all?

"For tracking unique page views"

  if(!sessionStorage[location.href]) {
    sessionStorage[location.href]=1;
    navigator.sendBeacon("/unique-pagehit?" + encodeURIComponent(location.href));
  }

"For tracking unique site views"

  if(!sessionStorage["Hi!"]) {
    sessionStorage["Hi!"]=1;    
    navigator.sendBeacon("/unique-sitehit");
  }

"For tracking previous requests"

I'm not sure I fully understand what is being measured (is it session-only?). For the duration someone watched a page, you can use sendBeacon in onBeforeUnload. To detect a bounce, set a Math.random() in a session variable, send it at the start of the page, and have every page load send the previously stored random variable. Then count the unique random keys you received on the server - those are the bounces.

I know, in practice you'll need to trim sessionStorage, sanitize URLs, use something less-colliding than Math.random, dealing with new tabs, some polyfills and other robustness, etc... but I don't yet see why the tracking mentioned needs any user ids or hashing at all.

JackWritesCode · on July 22, 2019

Great idea but we can't use sessionStorage under PECR, which is why we made this move. Plus we got rid of anything being stored on the user's machine.

ricardobeat · on July 22, 2019

This client-based scheme would be totally vulnerable to spamming / general messing about. Doing it in the server requires an actual request, and should still allow them to filter out attempts at gaming the system.

moose333 · on July 22, 2019

As a user of the open source version of Fathom, I'm a little concerned by the lag in publishing this update to the community edition. I assumed development work was happening in the open on Github, but I guess that's not the case?

JackWritesCode · on July 22, 2019

Whole new language & codebase. Old developer left, we don't write Go.

pauljarvis · on July 22, 2019

So it's not as easy as just pushing the update to the repo. We are still committed to open-source, but we also have a business to run and need to make a living here (we're two dudes who care about privacy, not a huge company with deep pockets) :)

The community version is getting a full update soon. We just have to focus on profit a bit (this keeps us in business and able to update the repo).

bluesmoon · on July 22, 2019

Speaking as someone who built an analytics business while still maintaining an open source version of the code, I can confirm that it's hard. Sometimes it starts off as one small patch that only makes sense for the hosted service, but you can't push to the OS branch until the code is refactored to protect secrets.

I cannot give you any tips on managing your time between the two, but you may want to consider raising your prices. Back in 2011, we ran our basic account at $50/mo, Business at $150/mo and Enterprise at $CallMe/mo. We could probably have upped that after a year with all the new features we'd put in, but we were acquired before the 2y mark and dropped the first 2 plans.

We still maintain the open source version of the code, but neither of the original founders work on it (for me it's gone back to being a hobby because there are now people paid full time to work on it). We still get questions about the lag in updates. We typically do quarterly bulk pushes to the open source version now.

JackWritesCode · on July 22, 2019

Thanks for sharing your story. It is hard and we have to fight to not resent OSS because of 0.001% of the OS community. Open-source software has contributed significantly to our lives, and we love it, so we're going to be pursuing OS Version 2 regardless of a few angry people.

gigatexal · on July 22, 2019

Armchair CEO here but charge a fair rate and if that means double do it. People will respect it and pay it if the product or service is helping them make money and I’d rather pay a higher amount and know I’m helping fund a tool that I use to help me be productive than get something cheap.

joekrill · on July 22, 2019

There's certainly nothing wrong with that. But this sounds a little defensive. And if you're going to promote the "Community Edition"/open-source nature of the codebase, this should probably be made more clear to your users/contributors. (Maybe it is made clear and I'm just not seeing it, though - this is my first time hearing about this product).

JackWritesCode · on July 22, 2019

I think the defensiveness comes when people put comments such as "I'm a little concerned by the lag in publishing this update". I've never spoken like this in my life to an OS contributor and I use open-source software every day. We're working hard to get the software OS but we were originally planning on keeping the codebases separate (since they were different languages). We only recently pivoted on this after speaking to some wonderful people in the OSS community. Perhaps our communication has been poor, so we'll work on that, but we are working hard at this.

tedivm · on July 22, 2019

I'm a big fan of fathom and have been using it for my personal sites for about six months now- I've completely ditched Google Analytics and other providers.

When I heard that the new version wasn't going to be open source it was disappointing- both for the simple fact that I value open source, and also for how it wasn't really announced so much as heavily implied until I actually asked. Knowing that you've listened to the community and are going to open source the new version as well is a huge relief for me, and will definitely result in me promoting your product again.

I would say from a communication standpoint things could have been clearer, and I'm sure you'll work on that. From a timeline standpoint I think it would be nice if the open source and hosted versions eventually were released together, but with a full rewrite I also understand that you probably want to clean it up and make sure it's in good shape before doing so. That being said if you want some help with alpha testing the open source version I'd be happy to assist, and I'm sure others would as well.

JackWritesCode · on July 22, 2019

You are the side of open-source that the world needs. We were going to keep the current repository as-is because it's written in Go, but we'll introduce a new one. Upon announcing that V2 was paid only, people were angry (understandably). We spoke with the community and the consensus was that everyone would prefer that we 'archive' V1 (Golang) and then open-source V2 moving forward. Sorry for the bad communication, it's been challenging with the move between languages!

tedivm · on July 22, 2019

That makes a ton of sense- you'll be able to get a lot more community support from the PHP community than the Golang one, just do to nature of size (and the fact that PHP is still very much a web language, which targets your audience well). It's been a few years since I was really involved in that community, but I still maintain a few open source projects and keep up with the language- who knows, maybe I'll be able to throw some bug fixes your way.

scarejunba · on July 22, 2019

You know, I do wonder why people talk like that these days. I cannot recall a time when it was acceptable and I certainly wouldn't do it myself. But these snide comments are so common on OSS thèse days.

nullstyle · on July 22, 2019

https://www.youtube.com/watch?v=nUBtKNzoKZ4

jdlshore · on July 23, 2019

Speaking as a paying customer of Fathom [1], thank you for your hard work. I know how hard it is to spin up a new business. I for one don't care whether your code is open source or not... I'm paying for the service and the code has no value to me. Please spend your limited time and money on building your business. I don't want the distraction of having to find another analytics platform. ;-)

[1] https://www.agilefluency.org

sjogress · on July 22, 2019

Out of curiosity, which language did you choose to write the new code base?

JackWritesCode · on July 22, 2019

We are writing Version 2 in PHP 7.3 with Laravel. We're pleased with it's performance (our hosted solution operates at large scale using Laravel) and it's our favourite language to work with. We'll be building the frontend in EmberJS and it will be incredibly fast.

atonse · on July 22, 2019

Awesome that you're using Ember! I'm hoping that (from my earlier question) I'll be easily able to track of page load/unload events using route transitions if I'm using this product.

ares2012 · on July 22, 2019

This is a common solution to the problem of PII, but without any information on returning users I would argue that it's value as an analytics platform is limited. Few are the tools where you can grow the business without knowing the difference between a first-time and return user which is the reason cookies were invented in the first place.

However, since such businesses already need to collect personal info as part of your account creation it shouldn't be hard to build analytics on top of that existing PII. If they are already collecting PII it doesn't seem to save much to have their analytics tool avoid it?

JackWritesCode · on July 22, 2019

Fathom Analytics is intentionally limited, and the limitations you point out are 100% intentional. There are many businesses who can't use our product, but millions that can :)

AndrewStephens · on July 22, 2019

Most schemes of this kind are just more complicated cookies that people hope will avoid the GDPR provisions by dint of being obfuscated.

What the article is discussing looks (at first brush) to be a sensible way of aggregating users up-front before it hits the database, rather than later. So no personal data is stored.

Does this meet the requirements for a site to avoid notifying users under the GDPR? I have no idea.

Even with the best of intentions, if you use a service like this then you are relying on them a) doing what they claim, and b) not screwing up (by leaving logs around, etc).

If I use this service and data from my users gets leaked by Fathom, who gets blamed? The users were on my site, so I guess it is I that gets fined. Maybe the risk is worth it, maybe it isn't.

robgough · on July 22, 2019

In response to your final question, this[1] document from the UK's ICO has some interesting info. Essentially you're either a Data Controller (that would be your site in this example) or a Data Processor (Fathom, in this case -- probably?!).

"64. The ICO cannot even take action directly against a processor who is entirely responsible for a data breach, for example by failing to deliver the security standards the controller has required it to put into place. However, in these cases the ICO may decide not to take any enforcement action against the controller if it believes it has done all it can to protect the personal data it is responsible for and to ensure the reliability of its processor, for example through a written contract. However, whilst the ICO cannot take action against the processor, the data controller could take its own civil action against its data processor, for example for breach of contract."

Though it goes on to say that in some circumstances, the processor can _become_ a controller, in which case the ICO can go after it.

[1]: https://ico.org.uk/media/for-organisations/documents/1546/da...

JackWritesCode · on July 22, 2019

And even if there was a data breach where myself & Paul were held at gunpoint and told export the database, there's no personal data to do anything with. Not even in our Redis queue! Our database is very boring, anonymous and simple, and we like it that way.

michaelbuckbee · on July 22, 2019

The GDPR (and other data privacy legislation) uses the concept of a "data controller" and a "data processor". Data controllers use a variety of data processors to deliver a service to their users.

If you ran a SAAS and used Fathom analytics, a SMTP email provider (to send password resets), a newsletter provider (for your monthly newsletter), a blog host etc. each of those would be data processors as you (the SAAS) are making the determination of where user data is going on the backend.

As a data controller, it's your responsibility to make sure that each of those services you are using is handling the data in an appropriate and safe way.

As to who gets fined if there is a data breach: the answer is likely nobody. I say this because it's not like you have a credit card on file and breach automatically means a fine. What really matters is what actions you're taking before and after a breach.

- As a data controller did you notify your users with no undue delay?

- As a data processor did you notify the SAAS with no undue delay?

- Did you identify the source of the breach? Did you take steps to remedy it?

For almost all these privacy regulations if you:

1. Take steps to protect user data/privacy like https, encryption etc.

2. Provide a mechanism to allow users to make data requests for their own identifying data

3. Notify users if there's a data breach

You would be in compliance.

munchbunny · on July 22, 2019

> Does this meet the requirements for a site to avoid notifying users under the GDPR? I have no idea.

Not necessarily. It's a bit complex, but the fact that Fathom ingests personal data at all likely means that they must still be disclosed by whoever is using Fathom's code.

On the other hand, if Fathom were able to push data ingestion into the first party's infrastructure so that only aggregated data hits Fathom's own infrastructure, I'm pretty sure that would put Fathom in the clear due to the fact that they are no longer processing or storing personal data.

To Fathom's credit, I am a big fan regardless of the steps they seem to be taking with anonymization, and I will consider using them instead of GA for future projects.

JackWritesCode · on July 22, 2019

What do you think about Recital 26? For GDPR, our stance can be found here: https://usefathom.com/data/

munchbunny · on July 22, 2019

I am neither a lawyer nor a DPO, I just worked on GDPR compliance in the past, so take what I say with a grain of salt.

I went and reread parts of GDPR, and I think you are right, though not because of Article 26. I think it's pretty clear that after your ingestion the data has been in good faith anonymized to the degree that it is no longer personal, and therefore your analytics code should be exempt from consent rules.

The interesting question to me is whether a controller deciding to put your pixel on a page for analytics purposes counts as you processing on the controller's behalf to an extent requiring consent. I don't see any clause specifically regarding third party access to personal data (as opposed to third parties processing personal data), so I agree with your stance that it's most likely fine.

M2Ys4U · on July 22, 2019

(obligatory IANAL) Recitals aren't legally binding in and of themselves, so I'd be wary of _relying_ on them _completely_.

That said, the articles of a regulation should be read in light of the recitals (as the latter are the rationale behind the actual binding legislation).

JackWritesCode · on July 22, 2019

Trust me, we're not relying on Recital 26, that is just my thought experiment :) See here for our GDPR compliance: https://usefathom.com/data/

labawi · on July 25, 2019

If visits expire after 30 min, why not rotate the salt every 30 min? Keep current and previous salt, update as needed.

I would have more faith in privacy, if you didn't store the salt in the DB or permanent storage. If you manage to statically load-balance the users (e.g. hash site, ip, user-agent, don't forget site), the hash could be in-memory only. Sessions would break on server restart, but that's more of a feature.

To move thing further, you might not even need to store the hashes in the DB. Keep them in server memory only and (real-time) update aggregate data in DB.

JackWritesCode · on July 26, 2019

The visit expires 30 minutes after the visitor lands. The expiration isn't generic.

It's an interesting idea. We have multiple servers under the load balancers, so we'd be able to store them in Redis, but that is no better than permanent storage, since Redis could still be breached and you'd see it with ease.

i_anon · on July 23, 2019

Hi Jack and Paul - love what you're doing! This solution is so needed.

I wondered whether you could explain what makes your hashing different from the hashing used by Facebook for their custom audiences tool which was deemed unsuitable for anonymisation as per https://www.spiritlegal.com/en/news/details/e-commerce-retai...

mrweasel · on July 22, 2019

Couldn’t people just parse the log files from their webservers?

bashy · on July 22, 2019

Definitely. Helps with people who disable JS as well. I've used this before https://goaccess.io

JackWritesCode · on July 22, 2019

bashy from laracasts? Hey there! And I've heard the disabled JS piece before. How many people are disabling javascript?

bashy · on July 24, 2019

Yes that's me. Hi there.

I use fathom on one of my sites and I like it. I was just replying to someone who said people could just use access logs instead. Some don't like adding JS to their site I guess.

Ad blockers will also block requests to the tracker.js if you use analytics.domain for example.

SCLeo · on July 22, 2019

Looking at their live demo (https://stats.usefathom.com/#!p=1w&g=hour), I can see a lot of traffic is coming from ycombinator. So...

(I mean, I don't have a point here but I find it pretty interesting. xD)

hedora · on July 23, 2019

Duck Duck Go traffic is 10% of the Google traffic.

Is that typical for privacy-focused tech sites? I would have expected a lower percentage.

(DDG fan here, but people look at me funny when they see me using it...)

ricardobeat · on July 22, 2019

That's exactly what they are selling - the ability to get relevant business metrics while preserving user privacy.

billabul · on July 22, 2019

sorry but, isn't that a (unnecessarily complex) cookie?

Spivak · on July 22, 2019

Look, entire industries exist for being complaint with the letter but not the spirit of the law so I'm sure that this in no way meets the definition of a cookie as far as the GPDR is concerned.

However, this is absolutely a cookie. Scraping just enough information from the browser to create a unique but stable hash and then having the browser compute it every request isn't at all different from that browser information acting as the cookie.

Nasrudith · on July 22, 2019

Technically no - cookies are stored by the user. If anything it is technically worse in some ways because it moves it to the back end outside of user control.

JackWritesCode · on July 22, 2019

We don't store a cookie but, yes, the hash doesn't need the date item in it since it has the salt. We'll change that in the next release

vmlpvf · on July 22, 2019

The data is not anonymous. Anonymity is actually very hard to claim (read k-anonymity, differential privacy, etc).

Nevertheless, the chances of identifying someone are probably pretty low, and it´s a good effort to make analytics more privacy friendly.

JackWritesCode · on July 22, 2019

So it's practically anonymous. Nobody has enough computing power to brute it & the data is deleted, typically, in 30 minutes.

nybble41 · on July 22, 2019

To de-anonymize the data you don't actually need to brute force the 256-bit hash. If the other pieces of data are known (salt, site, page, day of year) and you can make a shrewd guess at the user agent then you'd only need to brute force the 32-bit IP address().

() Assuming IPv4. Obviously IPv6 addresses would be much harder to brute force, but still easier than a 256-bit hash.

JackWritesCode · on July 22, 2019

I do take your point but when we get into this area, it becomes a big question of trust. Because if I gave you a hash, you would need to guess the 256 bit salt in addition to all the other possibilities.

I mean, hey, here's a hash: 26246226167b9f190d3a1ce726efe07ae18bbf0480a78d19390b9aaf13f25cb0

Imagine you just got hold of it through a data breach. I'll give you $1,000 if you can get it de-hashed before midnight Chicago time ;)

nybble41 · on July 22, 2019

It's true that you'd need to know the salt first, but the server does know the salt, and if you obtained the hash through a data breach then you probably obtained the current day's salt as well.

JackWritesCode · on July 22, 2019

So if someone had unlimited access to our servers, that would be a problem. One piece to note is that page views get deleted after around 30 minutes.

So one of the reasons we posted to HN was for conversations like this. Reading what you put makes me think we need to do more when creating the PageRequestSignature and the SiteRequestSignature, because if someone had access to our server, got the hash, stopped all cron jobs from processing data, then they could work it out after a huge amount of time / computing power. But to be honest, at this point, they could also add log($_SERVER) and get the entire request body of a user, so we would have much bigger problems in that scenario.

Anyway, you've given me a few new thoughts on hardening from data breaches. Because, yes, if they know the hash & have full access to our server then it becomes easier. So we almost need to move it to the point where they won't know the salt that was used for a particular user. So we'd not recycle a single salt, we'd recycle multiple salts that are based on, perhaps, the first 2-3 digits of a users IP address combined with the last 2-3 digits.... Then it would be much harder to break without first knowing the $_SERVER dump (the user's IP etc.). Obviously this wouldn't stop a "complete control of server" attack where they could just start logging everything but it would really ruin a brute forcers day because they wouldn't know what salt to start with.

What do you think of that idea? I'm running on little sleep so be nice ;) Also thanks so much for all your feedback so far, it's so appreciated!

nybble41 · on July 23, 2019

The case I would be (slightly) concerned about would be where an attacker has obtained limited, read-only access to your database, so they have the current salt and the hashes related to the last 30 minutes of activity but not the ability to simply log whatever data they want. At that point brute-forcing the 32-bit address space and a small set of common user agent strings would allow them to determine the IP addresses for specific page views.

Of course, if an attacker has access to the salt and is interested in the pages viewed by a specific known IP address then this all becomes much simpler. Then the only unknown is the user agent, which is relatively low-entropy.

> So we'd not recycle a single salt, we'd recycle multiple salts that are based on, perhaps, the first 2-3 digits of a users IP address combined with the last 2-3 digits....

If an attacker is already brute forcing the IP address and has the ability to obtain the salt then they would just use the correct salt for each IP address. Making the salt depend on other data already included in the hash doesn't change the size of the search space.

JackWritesCode · on July 24, 2019

I think what’s been helpful for us with posting this here is to hear of different ideas for how someone might hack Fathom. When we came up with it, our starting point wasn’t “you have the salt and IP address”, go break a hash. It was on the assumption that you don’t have the salt or IP. I think we can improve what we’ve built. 720 salts improves resilience in a few areas but not in the scenario you are painting here. The scenario you’re painting here has made me think of additional ideas though.

If they had the salt, IP and user agent, they’d have to also brute force every possible hostname and pathname, which would be insane. But I suppose they’d only have to do a few million based on the data we have on hostnames / pathnames...

Lots of ideas for improvement are popping into my head and I love how this community keeps challenging you to improve things. We had feedback on Reddit but it was much angrier!

The next step is to take your feedback and look into how we would defend against the scenario you’ve provided. Thank you!

tomp · on July 22, 2019

Maybe I'm missing something but (1) I don't think this is GDPR compliant, and (2) why so complicated?

Regarding (1),

> Brute forcing a 256 bit hash would cost 10^44 times the Gross World Product (GWP). [...]

> We have rendered the data anonymous to the point where we could not identify a natural person from the hash.

> It's possible that GDPR does not apply to Fathom since data is made completely anonymous. Even if GDPR did still apply, we reiterate the stance that there is legitimate business interest to understand how your website is performing.

This seems to imply a profound confusion between the difference of hashing vs. anonymity. Just because it's hashed doesn't mean it's anonymous! You don't need to "brute-force" the hash, you just need to find a user that matches your hash... which is 1 in 7 billion (or so), much more tractable. This is also the principle e.g. MD5 rainbow tables are based on...

They claim to change the hash every 24 hours, so it's equivalent to having a session cookie with 24-hour expiration (session cookies are "anonymous" by their definition, they don't have any user information and they're impossible to "brute force", they "just" enable tracking). I've no idea if 24-hour session cookies are GDPR-compliant...

Regarding (2), given that this seems (again, I might be misunderstanding) equivalent to a 24-hour session cookie, why not just do that? However, then you're ... drumroll ... giving control to the user. Why not just give control to the user, period?! For example, by storing the list of pages visited in Local Storage, and only pinging the server once for each page(view) every 24 hours?

JackWritesCode · on July 22, 2019

For GDPR compliance notes please see: https://usefathom.com/data/

> You don't need to "brute-force" the hash, you just need to find a user that matches your hash... which is 1 in 7 billion (or so), much more tractable. This is also the principle e.g. MD5 rainbow tables are based on...

Not quite. We use a SHA256 hash as our salt, and that changes each day, so you'd need to brute force that.

In terms of how many possible combinations there are for this salt, please see: https://stackoverflow.com/a/49520766 - you would need to brute force it and try each possible combination with every single possible IP / User Agent / Site combination to break a hash. This is why it's not theoretically impossible but it's practically impossible.

We would love to approach things in an easier way but PECR doesn't want cookies, even anonymous ones.

Now, one thing that we have uncovered thanks to someone on here is that we need to increase our resistance to data breaches. If someone had complete, unlimited access to all our data / servers, including the daily salt, then they could de-hash page views from the last 30 minutes. I have no idea how long that would take. There are 4,294,967,296 (?) possible IP addresss, and then over 3M (?) user agents, so it'd be an absurd, pointless exercise.... Anyway, we're going to be bringing in multiple salts that depend on the user IP address, meaning that, in the event of a data breach, a hacker won't know which salt has been used for a hash :) Perhaps we base the salts on the first 3 digits of an IP address? That would mean we'd have 720 possible SHA256 salts!

hedora · on July 23, 2019

You can get an order of magnitude on hash collision resistance by rolling every two hours. Maintain “two” backend databases to gracefully track sessions between roll overs.

Also, for non GPDR IP blocks, maybe just store a per client salt in a cookie(!) and then xor it with the rotating server salt.

JackWritesCode · on July 23, 2019

I’m not worried about hash collision with sha256. Any reason why I should be? :)

And we can’t use cookies because of PECR!

saagarjha · on July 22, 2019

> Tracking page views alone, without visits, is completely useless and means that you won’t have insight into how many people visit your site / pages each day.

What's the difference between a page view and a visit?

JackWritesCode · on July 22, 2019

If I came onto your website and refreshed one of your pages 500 times, that would count as 500 page views but only 1 visit.

jacquesm · on July 22, 2019

At first glance this appears to be a well thought out solution. It will never be able to give you some of the stuff that GA can give you but that is by design. The problem that I see is that as long as GA is able to claim they are GDPR compliant there will be very few websites that will see this as a necessity and so adoption will be relatively low. But, and this is just an idea, one of the things that company could do is to proudly present a 'zero retention' button or logo assuming they do not have other trackers on their pages. That way it might become a distinguishing factor for the adopters and that might drive further adoption.

Thanks for building this, I will promote it.

JackWritesCode · on July 22, 2019

A fantastic idea. We've recently designed a button for websites to show users that they care but I really love this idea of a 'zero retention' button :)

st3ve445678 · on July 22, 2019

Looks decent, but pricing is insanely high for the extremely limited set of stats.

pauljarvis · on July 22, 2019

Totally fair, as that's an opinion :) Luckily our customers are happy with the price, and other folks use the open-source (100% free) version. Cheers!

paulgb · on July 22, 2019

I actually think your prices are pretty reasonable for business use, but I wonder if you've considered a "personal" plan of, say, $6/month for up to 10k hits, non-commercial use only.

I've been looking for a privacy-first tracker for some personal sites for a while, but nobody is offering pricing that makes sense at the lower end.

cm2012 · on July 23, 2019

The price is equivalent to one hour of developer time per year.

soared · on July 22, 2019

Pricing looks comically cheap for SMB and above.

LeonidBugaev · on July 22, 2019

exactly, expected to be 5-10 times higher.

CHsurfer · on July 22, 2019

I think the GDPR was enacted into law not to prevent cookies, but to prevent collecting data on regular people. This seems to circumvent the technicalities of the law but not the spirit. The risk is that they enact a new law that puts even further restrictions on website operators.

I'm not sure this is a good idea.

pauljarvis · on July 22, 2019

Appreciate the note and thought here. I do disagree though, as it feels like the spirit of GDPR is to make into law the protection and privacy for regular people. Fathom does this to the best of our ability, and our code reflects our agreement with the spirit of the law.

Analytics is required for business and isn't going anywhere. The laws don't feel like they are trying to shut down analytics completely, they are just asking this type of software to do better. That's what I think we are doing with this—and there are no other analytics companies who come close to our level of obfuscation and non-tracking of personal data.

If the intent of the law is do better with privacy and data, we are doing it to the best of our abilities. It's not a skirting around the issue, we are agreeing with it in our code and logic for how our tracker works.

JackWritesCode · on July 22, 2019

Thanks for the concern here. We are GDPR compliant (and may be exempt from it). See here: https://usefathom.com/data/

testudovictoria · on July 22, 2019

Tell me if I get this correct:

Alice visit a site and gets the hash 1234. The analytics data is stored and associated with hash 1234, but soon after, hash 1234 is removed. However the aggregate visitor analytic that was associated with hash 1234 data persists. Then another user (say Alice again) returns and gets hash 5678. Analytic data is tracked, stored with hash 5678 for the 30 minutes (or less), and then hash 5678 is again removed. However the analytic data that was associated with 5678 is aggregated with the rest?

JackWritesCode · on July 22, 2019

That's exactly how it works. The purpose being to make it completely impossible to ever single out a user and see which pages they viewed on a website.

jfk13 · on July 22, 2019

You might like to edit the line on that policy page that refers to "the most privacy-focused manor"... while a privacy-focused manor is an interesting idea, I suspect you meant "manner". :)

i_anon · on July 23, 2019

Equally, I'm not sure what it means to be GDPR "complaint" but I'm thinking it's probably supposed to be "compliant" ;)

pauljarvis · on July 22, 2019

lolololol

felixfbecker · on July 22, 2019

It's so exciting that thanks to GDPR we are now seeing innovate analytics solutions that respect privacy.

pauljarvis · on July 22, 2019

Agreed!

EGreg · on July 22, 2019

I am not sure what exactly they did here. How do they persist the hash between requests?

My guess is they use localStorage and sending the hash to their servers with each request.

So we are talking about a mechanism that’s just like a cookie.

As long as they don’t have any PII and can’t figure out who the user was, then I think the GDPR gives them an exception.

But “without cookies” claim is dubious!

jacquesm · on July 22, 2019

It's a session long affair, rather than a persistent affair, with 30 minutes apparently the arbitrary cut-off for a session.

JackWritesCode · on July 22, 2019

We have been GDPR compliant for many months but our aim here was to meet E-Privacy demands.

We don't use localStorage... Read the blog post, we don't use cookies.

EGreg · on July 22, 2019

Oh, apologies, you are right! You don't use any kind of cookie mechanism. I just read it again:

  Random SHA256 String (daily regenerated)
  IP Address
  User Agent
  Site ID
  Day of the year

My only question is about this "Random SHA256 String", where is it stored between requests?

JackWritesCode · on July 22, 2019

Redis Cache. It's a Fathom-wide random string that is used to prevent rainbow table attacks. The salt is refreshed at midnight every day.

EGreg · on July 22, 2019

Thanks, makes total sense.

So basically the only drawback I see is that all employees from behind one corporate NAT using same browser will count as one user.

But don’t see a way around that if you can only use IP and User Agent strings.

itronitron · on July 22, 2019

why are you calling it an analytics platform when it isn't one?

gcbw2 · on July 22, 2019

This is still logging everything the GDPR says you can't without asking for consent, but you made your search convoluted (but not less efficient if you have all the pieces) to (suggest|lie?) that you need to break the hash and that's why you don't need consent.

None of the information you are using on the hash wouldn't be in the search query itself! ip, user agent, path, date, etc. So there is no way to reverse the hash. You just hash your search query and compare in O(1) time.

The only piece of information that realistically makes the hash slightly difficult to get is the random number refreshed every day. But either you store it (and i have no reason to believe you do not) or it make the brute force effort trivial as I only need to generate the hash with that variable now.

JackWritesCode · on July 22, 2019

You're focused mostly on Recital 26, which was only a theory of mine, outside of that we are GDPR compliant anyway. I likely shouldn't have included it since that isn't our primary ground for processing. Please see: https://usefathom.com/data/

And yes the daily hash gets stored until midnight. But what are you talking about with 'search query' containing IP, user agent etc.?

gcbw2 · on July 24, 2019

If a search query on your data would contain all the components of the original hash, i don't have to walk backwards and break the hash. i just have to hash my query terms in the same way.

Also I suggested you store the daily hash forever. But even if you really erase it every day, as you say, If you or an attacker makes the same request every day at a predetermined time, when you/they get your logs, you/they can use that predictable request to get the daily secret too.

I consider the information to be stored in plain text, and that you would have to have requested permission just the same. You pretty much have an identifiable user (via IP/UA/access time) stored in your logs.

Anonymization is removal of information, not encoding it in a convoluted hash.

JackWritesCode · on July 24, 2019

So that needs to be our next target point (access logs). We want to move to a position to keep no access lgos.

And a hacker could indeed "win" if they broke into our system, got the salt and exported the DB. We didn't focus on this in our article, as it's unbelievably unrealistic, but it's still possible. Our next step is to address that.

Without the hash, it's practically impossible to brute force.

gcbw2 · on July 26, 2019

Not talking about a hacker. I am stating that the described hash dance offers no exclusion from GDPR as saying "we promise we won't look" would do.

My point about brute forcing being useless, is that you hold all the information needed to re-create the hash. All but one tiny piece that is the random number. so brute force is a very effective O(<tiny piece size>). And since it is stored in your locally available data, there is no rate constraints.

JackWritesCode · on July 27, 2019

> I am stating that the described hash dance offers no exclusion from GDPR as saying "we promise we won't look" would do.

Under your logic, you would never trust us because we could just add $log->write(UserIp, UserAgent, Hostname, Path) in plain text. Trust is very important and what you do with the data is important under GDPR.

And we don't hold all the information to re-create the hash, that's the thing.

I thought a lot about "Oh but you could just do this, this and this" but, no, that argument doesn't hold. Our obligation under GDPR is what we actually do with data.

kitchenkarma · on July 22, 2019

This is very weak reasoning, because you cannot identify an individual by IP either. This project looks like trying to exploit loopholes. The idea behind GDPR is to make sure companies log only data they need. This project looks into logging the data but without expressing why this is even necessary. Therefore I don't think this is compliant with GDPR.

vonmoltke · on July 22, 2019

> because you cannot identify an individual by IP either

Yes you can, particularly if you correlate across different websites.

kitchenkarma · on July 22, 2019

You are conflating identification of a person by behaviour analysis with matching an ID. What is the ID is irrelevant here - may as well be a hash. That just proves my point that this project is not compliant.

JackWritesCode · on July 22, 2019

I remember when I learned that IP was considered personal information, I was shocked. But I thought about it and it does make sense.

JackWritesCode · on July 22, 2019

GDPR is for protection of personal data and we store no personal data. Please take a read of this: https://usefathom.com/data/

kitchenkarma · on July 22, 2019

I don't believe you have understanding what personal data and GDPR is. You are capturing user behaviour and that is very personal regardless if it is "anonymised" or not - and that is without clear need for doing that. That is pretty much against GDPR.

JackWritesCode · on July 22, 2019

You come across as somewhat hostile but I'm going to assume good intent on your part, so thank you for the challenges on our stance.

So if you take a look at Recital 26 (https://gdpr-info.eu/recitals/no-26/):

> To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly.

> To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.

> The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.

> This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes.

So the piece about the principles of data protection not applying to personal data rendered anonymous is crucial. We believe that GDPR does not apply to us because of that. But even if GDPR did apply to us (we'll assume it does, that's always the best way to be), then our legal basis is that there's legitimate interest. As a website owner, it is in your legitimate business interest to understand how your website is performing - e.g. the most popular pages, the pages where people linger for longer, the pages where people bounce.

number6 · on July 22, 2019

Article 4 (1) states:

‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;

- a hash number falls into this. You cannot just quote recital 26 and stop reading since you found it fitting. Recital 30 covers the case for "other identifiers" that might replace cookies. No hard feelings we all do.

The data might be anonymous for a third party but if you can single out just one person or in other word one unique visitor it is not anonymously. NB. One IP poisons the whole data.

So your fallback is Article 6(f) which is reasonable but you can not assume the interest of a site owner is always higher than the interest of the visitors. You have to put your arguments into writing and have the means for people to appeal it. 6f is not meant as a blanco cheque or batch job...

JackWritesCode · on July 22, 2019

> Online identifiers for profiling and identification > Natural persons may be associated with online identifiers provided by their devices, applications, tools and protocols, such as internet protocol addresses, cookie identifiers or other identifiers such as radio frequency identification tags. > This may leave traces which, in particular when combined with unique identifiers and other information received by the servers, may be used to create profiles of the natural persons and identify them.

The thing is, you can't create profiles. So right now I could give you a single entry for our website

> NULL "" "https://www.usefathom.com" "bb9377f4cf33093765835a48e962a5dbd3168499abd12b120c8c118c86c41479"

How could we possibly use that to profile / identify? The hash (bb9377f4cf33093765835a48e962a5dbd3168499abd12b120c8c118c86c41479) is unique in the database table and never repeats.

I hear you. We don't rely on Recital 26 to comply with GDPR. I've not had the Recital 26 piece confirmed by a lawyer but it's a personal hunch / exploration. Hearing your comments on Article 30 are helpful, thank you, would like to hear your thoughts on my reply if possible :)

number6 · on July 23, 2019

If you dont need the identifier, why don't leave it out at all (Art 5c)? Or is just in it this case unique?

JackWritesCode · on July 23, 2019

Identifier is used for previous view only. Previous view is updated when a new view is inserted into the temp table & previous view's user identifier is wiped.

Matticus_Rex · on July 22, 2019

They don't have any PII and are therefore not subject to GDPR. They have data that, if it were not anonymized, would be PII, but it's anonymized and therefore isn't.

kitchenkarma · on July 22, 2019

It doesn't matter if it is an IP or another identifier e.g. a hash. Person can be identified by behaviour and this is not anonymised.

JackWritesCode · on July 22, 2019

How can a person be identified from a hash by behaviour? We built the software but you seem to know something we don't...