Hacker News new | past | comments | ask | show | jobs | submit login
How we built a GDPR-compliant website analytics platform without using cookies (usefathom.com)
207 points by pauljarvis 60 days ago | hide | past | web | favorite | 120 comments

We are incredibly open to any ideas, comments or concerns on how we're doing this. This is a big step up from what we had previously, but there’s always room for improvement. Happy to hear thoughts in the comments.

Hi Paul, thanks for being open about this. I have a big, important question.

ICO, the agency in charge of enforcing GDPR and related legislation in England, released guidance earlier this month on the topics of cookies. One of the most notable parts of this guidance is that "device fingerprinting" is treated the same as a cookie[1]. And also that website analytics requires consent to use cookies or similar technologies[2] ("similar technologies" including device fingerprinting).

Now, the above guidance is related to PECR rather than GDPR, which is what your post is about. But, given the above, do you think that your software is compliant/exempt from PECR or do you think that organizations will still have to take extra steps to be compliant with privacy legislation?

[1] https://ico.org.uk/for-organisations/guide-to-pecr/guidance-...

[2] https://ico.org.uk/for-organisations/guide-to-pecr/guidance-...

Totally, so we feel we follow the spirit of the PECR law, but until there's a case against it, we don't have precedent. But we feel like if analytics was under fire we'd be at the bottom of the list because we've gone out of our way to follow the spirit of it.

We don't consider ourselves to be building any sort of 'server side cookie', especially since an anonymous hash is only ever tied to one piece of data and is actually set to null as soon as another request comes in. Unlike cookies, data doesn't follow the user around as they browse the site. A cookie would stick with you as you browse a website.

We've spoken with a few lawyers about this and there's too much grey area at the moment. Time will tell and we're hoping that the UK (my home country) sort PECR out.

I think that's a fair question.

AFAICT, v1 of PECR awkwardly applies whenever the cookie is not functionally directly necessary for the service that the user is using. PECR applies even if, like here, the cookie is just for counting unique numbers of visitors, and is not used for fingerprinting individuals.

The draft v2 of PECR contains an exemption for first party analytics. I think this maybe strikes a nice balance: explicit consent would still be required for the more-harmful third party analytics.

Not sure when v2 of PECR will happen. It is years overdue. Perhaps it is a priority for the newly elected European Parliament and the new Commission?

What makes analytics first-party? When the first party serves them, or only if the data never leaves machines under the direct and exclusive control of the first party?

I don't know the answer to that.

FWIW, my guess would be that the definition is fairly strict.

Now, I don't think that would prevent the first party using data processors, but I suspect the first party would have to exercise a lot of control over the processor. This would be in contrast to a service like Google Analytics, where the company's control and choice is simply limited to take it or leave it.

A data processor agreement is not usually negotiated all that hard and this is indeed not really possible with truly large companies (and let's face it, that's where most companies get their enterprise software). Therefore I feel that expecting a lot of control being exercised is a bit of a pipe dream.

I trust that there's enough questions and scrutiny about your anonymization, so I don't have any questions about that. Mine is more about implementation.

If I want to integrate this with a single page app (like Ember or React), are there enough API hooks on how I can track click events and page load events, etc in the JS? We threw together Google Analytics for our launch just so we would have SOME data, but we want to move away from it ASAP for privacy reasons.

Anymore about how it would compare to something like Piwik? (the product we're looking at).

We <3 EmberJS. Fathom's Dashboard is built using it. You'd want to built it into the Router and call fathom('trackPageview') to log it :)

How do you guard against hash collision?

Based on the blog - anyone who shares a IP address (such as inside a company network) would effectively look the same.

And given most companies run managed browsers on managed systems, the user agent is going to collide too

Not just inside a company network. CGNAT is pretty commonplace nowadays.

Indeed. This is one of the limitations to our approach. We're thinking of ways to workaround it. Please shout if you have any ideas ;)

If/When IPv6 is widely used the problem should fade away.

Why all the trouble with hashes - can't you just do it on the client and not having to store any data at all?

"For tracking unique page views"

  if(!sessionStorage[location.href]) {
    navigator.sendBeacon("/unique-pagehit?" + encodeURIComponent(location.href));
"For tracking unique site views"

  if(!sessionStorage["Hi!"]) {
"For tracking previous requests"

I'm not sure I fully understand what is being measured (is it session-only?). For the duration someone watched a page, you can use sendBeacon in onBeforeUnload. To detect a bounce, set a Math.random() in a session variable, send it at the start of the page, and have every page load send the previously stored random variable. Then count the unique random keys you received on the server - those are the bounces.

I know, in practice you'll need to trim sessionStorage, sanitize URLs, use something less-colliding than Math.random, dealing with new tabs, some polyfills and other robustness, etc... but I don't yet see why the tracking mentioned needs any user ids or hashing at all.

Great idea but we can't use sessionStorage under PECR, which is why we made this move. Plus we got rid of anything being stored on the user's machine.

This client-based scheme would be totally vulnerable to spamming / general messing about. Doing it in the server requires an actual request, and should still allow them to filter out attempts at gaming the system.

As a user of the open source version of Fathom, I'm a little concerned by the lag in publishing this update to the community edition. I assumed development work was happening in the open on Github, but I guess that's not the case?

Whole new language & codebase. Old developer left, we don't write Go.

So it's not as easy as just pushing the update to the repo. We are still committed to open-source, but we also have a business to run and need to make a living here (we're two dudes who care about privacy, not a huge company with deep pockets) :)

The community version is getting a full update soon. We just have to focus on profit a bit (this keeps us in business and able to update the repo).

Speaking as someone who built an analytics business while still maintaining an open source version of the code, I can confirm that it's hard. Sometimes it starts off as one small patch that only makes sense for the hosted service, but you can't push to the OS branch until the code is refactored to protect secrets.

I cannot give you any tips on managing your time between the two, but you may want to consider raising your prices. Back in 2011, we ran our basic account at $50/mo, Business at $150/mo and Enterprise at $CallMe/mo. We could probably have upped that after a year with all the new features we'd put in, but we were acquired before the 2y mark and dropped the first 2 plans.

We still maintain the open source version of the code, but neither of the original founders work on it (for me it's gone back to being a hobby because there are now people paid full time to work on it). We still get questions about the lag in updates. We typically do quarterly bulk pushes to the open source version now.

Thanks for sharing your story. It is hard and we have to fight to not resent OSS because of 0.001% of the OS community. Open-source software has contributed significantly to our lives, and we love it, so we're going to be pursuing OS Version 2 regardless of a few angry people.

Armchair CEO here but charge a fair rate and if that means double do it. People will respect it and pay it if the product or service is helping them make money and I’d rather pay a higher amount and know I’m helping fund a tool that I use to help me be productive than get something cheap.

There's certainly nothing wrong with that. But this sounds a little defensive. And if you're going to promote the "Community Edition"/open-source nature of the codebase, this should probably be made more clear to your users/contributors. (Maybe it is made clear and I'm just not seeing it, though - this is my first time hearing about this product).

I think the defensiveness comes when people put comments such as "I'm a little concerned by the lag in publishing this update". I've never spoken like this in my life to an OS contributor and I use open-source software every day. We're working hard to get the software OS but we were originally planning on keeping the codebases separate (since they were different languages). We only recently pivoted on this after speaking to some wonderful people in the OSS community. Perhaps our communication has been poor, so we'll work on that, but we are working hard at this.

I'm a big fan of fathom and have been using it for my personal sites for about six months now- I've completely ditched Google Analytics and other providers.

When I heard that the new version wasn't going to be open source it was disappointing- both for the simple fact that I value open source, and also for how it wasn't really announced so much as heavily implied until I actually asked. Knowing that you've listened to the community and are going to open source the new version as well is a huge relief for me, and will definitely result in me promoting your product again.

I would say from a communication standpoint things could have been clearer, and I'm sure you'll work on that. From a timeline standpoint I think it would be nice if the open source and hosted versions eventually were released together, but with a full rewrite I also understand that you probably want to clean it up and make sure it's in good shape before doing so. That being said if you want some help with alpha testing the open source version I'd be happy to assist, and I'm sure others would as well.

You are the side of open-source that the world needs. We were going to keep the current repository as-is because it's written in Go, but we'll introduce a new one. Upon announcing that V2 was paid only, people were angry (understandably). We spoke with the community and the consensus was that everyone would prefer that we 'archive' V1 (Golang) and then open-source V2 moving forward. Sorry for the bad communication, it's been challenging with the move between languages!

That makes a ton of sense- you'll be able to get a lot more community support from the PHP community than the Golang one, just do to nature of size (and the fact that PHP is still very much a web language, which targets your audience well). It's been a few years since I was really involved in that community, but I still maintain a few open source projects and keep up with the language- who knows, maybe I'll be able to throw some bug fixes your way.

You know, I do wonder why people talk like that these days. I cannot recall a time when it was acceptable and I certainly wouldn't do it myself. But these snide comments are so common on OSS thèse days.

Speaking as a paying customer of Fathom [1], thank you for your hard work. I know how hard it is to spin up a new business. I for one don't care whether your code is open source or not... I'm paying for the service and the code has no value to me. Please spend your limited time and money on building your business. I don't want the distraction of having to find another analytics platform. ;-)

[1] https://www.agilefluency.org

Out of curiosity, which language did you choose to write the new code base?

We are writing Version 2 in PHP 7.3 with Laravel. We're pleased with it's performance (our hosted solution operates at large scale using Laravel) and it's our favourite language to work with. We'll be building the frontend in EmberJS and it will be incredibly fast.

Awesome that you're using Ember! I'm hoping that (from my earlier question) I'll be easily able to track of page load/unload events using route transitions if I'm using this product.

This is a common solution to the problem of PII, but without any information on returning users I would argue that it's value as an analytics platform is limited. Few are the tools where you can grow the business without knowing the difference between a first-time and return user which is the reason cookies were invented in the first place.

However, since such businesses already need to collect personal info as part of your account creation it shouldn't be hard to build analytics on top of that existing PII. If they are already collecting PII it doesn't seem to save much to have their analytics tool avoid it?

Fathom Analytics is intentionally limited, and the limitations you point out are 100% intentional. There are many businesses who can't use our product, but millions that can :)

Most schemes of this kind are just more complicated cookies that people hope will avoid the GDPR provisions by dint of being obfuscated.

What the article is discussing looks (at first brush) to be a sensible way of aggregating users up-front before it hits the database, rather than later. So no personal data is stored.

Does this meet the requirements for a site to avoid notifying users under the GDPR? I have no idea.

Even with the best of intentions, if you use a service like this then you are relying on them a) doing what they claim, and b) not screwing up (by leaving logs around, etc).

If I use this service and data from my users gets leaked by Fathom, who gets blamed? The users were on my site, so I guess it is I that gets fined. Maybe the risk is worth it, maybe it isn't.

In response to your final question, this[1] document from the UK's ICO has some interesting info. Essentially you're either a Data Controller (that would be your site in this example) or a Data Processor (Fathom, in this case -- probably?!).

"64. The ICO cannot even take action directly against a processor who is entirely responsible for a data breach, for example by failing to deliver the security standards the controller has required it to put into place. However, in these cases the ICO may decide not to take any enforcement action against the controller if it believes it has done all it can to protect the personal data it is responsible for and to ensure the reliability of its processor, for example through a written contract. However, whilst the ICO cannot take action against the processor, the data controller could take its own civil action against its data processor, for example for breach of contract."

Though it goes on to say that in some circumstances, the processor can _become_ a controller, in which case the ICO can go after it.

[1]: https://ico.org.uk/media/for-organisations/documents/1546/da...

And even if there was a data breach where myself & Paul were held at gunpoint and told export the database, there's no personal data to do anything with. Not even in our Redis queue! Our database is very boring, anonymous and simple, and we like it that way.

The GDPR (and other data privacy legislation) uses the concept of a "data controller" and a "data processor". Data controllers use a variety of data processors to deliver a service to their users.

If you ran a SAAS and used Fathom analytics, a SMTP email provider (to send password resets), a newsletter provider (for your monthly newsletter), a blog host etc. each of those would be data processors as you (the SAAS) are making the determination of where user data is going on the backend.

As a data controller, it's your responsibility to make sure that each of those services you are using is handling the data in an appropriate and safe way.

As to who gets fined if there is a data breach: the answer is likely nobody. I say this because it's not like you have a credit card on file and breach automatically means a fine. What really matters is what actions you're taking before and after a breach.

- As a data controller did you notify your users with no undue delay?

- As a data processor did you notify the SAAS with no undue delay?

- Did you identify the source of the breach? Did you take steps to remedy it?

For almost all these privacy regulations if you:

1. Take steps to protect user data/privacy like https, encryption etc.

2. Provide a mechanism to allow users to make data requests for their own identifying data

3. Notify users if there's a data breach

You would be in compliance.

> Does this meet the requirements for a site to avoid notifying users under the GDPR? I have no idea.

Not necessarily. It's a bit complex, but the fact that Fathom ingests personal data at all likely means that they must still be disclosed by whoever is using Fathom's code.

On the other hand, if Fathom were able to push data ingestion into the first party's infrastructure so that only aggregated data hits Fathom's own infrastructure, I'm pretty sure that would put Fathom in the clear due to the fact that they are no longer processing or storing personal data.

To Fathom's credit, I am a big fan regardless of the steps they seem to be taking with anonymization, and I will consider using them instead of GA for future projects.

What do you think about Recital 26? For GDPR, our stance can be found here: https://usefathom.com/data/

I am neither a lawyer nor a DPO, I just worked on GDPR compliance in the past, so take what I say with a grain of salt.

I went and reread parts of GDPR, and I think you are right, though not because of Article 26. I think it's pretty clear that after your ingestion the data has been in good faith anonymized to the degree that it is no longer personal, and therefore your analytics code should be exempt from consent rules.

The interesting question to me is whether a controller deciding to put your pixel on a page for analytics purposes counts as you processing on the controller's behalf to an extent requiring consent. I don't see any clause specifically regarding third party access to personal data (as opposed to third parties processing personal data), so I agree with your stance that it's most likely fine.

(obligatory IANAL) Recitals aren't legally binding in and of themselves, so I'd be wary of _relying_ on them _completely_.

That said, the articles of a regulation should be read in light of the recitals (as the latter are the rationale behind the actual binding legislation).

Trust me, we're not relying on Recital 26, that is just my thought experiment :) See here for our GDPR compliance: https://usefathom.com/data/

If visits expire after 30 min, why not rotate the salt every 30 min? Keep current and previous salt, update as needed.

I would have more faith in privacy, if you didn't store the salt in the DB or permanent storage. If you manage to statically load-balance the users (e.g. hash site, ip, user-agent, don't forget site), the hash could be in-memory only. Sessions would break on server restart, but that's more of a feature.

To move thing further, you might not even need to store the hashes in the DB. Keep them in server memory only and (real-time) update aggregate data in DB.

The visit expires 30 minutes after the visitor lands. The expiration isn't generic.

It's an interesting idea. We have multiple servers under the load balancers, so we'd be able to store them in Redis, but that is no better than permanent storage, since Redis could still be breached and you'd see it with ease.

Hi Jack and Paul - love what you're doing! This solution is so needed.

I wondered whether you could explain what makes your hashing different from the hashing used by Facebook for their custom audiences tool which was deemed unsuitable for anonymisation as per https://www.spiritlegal.com/en/news/details/e-commerce-retai...

Couldn’t people just parse the log files from their webservers?

Definitely. Helps with people who disable JS as well. I've used this before https://goaccess.io

bashy from laracasts? Hey there! And I've heard the disabled JS piece before. How many people are disabling javascript?

Yes that's me. Hi there.

I use fathom on one of my sites and I like it. I was just replying to someone who said people could just use access logs instead. Some don't like adding JS to their site I guess.

Ad blockers will also block requests to the tracker.js if you use analytics.domain for example.

Looking at their live demo (https://stats.usefathom.com/#!p=1w&g=hour), I can see a lot of traffic is coming from ycombinator. So...

(I mean, I don't have a point here but I find it pretty interesting. xD)

Duck Duck Go traffic is 10% of the Google traffic.

Is that typical for privacy-focused tech sites? I would have expected a lower percentage.

(DDG fan here, but people look at me funny when they see me using it...)

That's exactly what they are selling - the ability to get relevant business metrics while preserving user privacy.

sorry but, isn't that a (unnecessarily complex) cookie?

Look, entire industries exist for being complaint with the letter but not the spirit of the law so I'm sure that this in no way meets the definition of a cookie as far as the GPDR is concerned.

However, this is absolutely a cookie. Scraping just enough information from the browser to create a unique but stable hash and then having the browser compute it every request isn't at all different from that browser information acting as the cookie.

Technically no - cookies are stored by the user. If anything it is technically worse in some ways because it moves it to the back end outside of user control.

We don't store a cookie but, yes, the hash doesn't need the date item in it since it has the salt. We'll change that in the next release

The data is not anonymous. Anonymity is actually very hard to claim (read k-anonymity, differential privacy, etc).

Nevertheless, the chances of identifying someone are probably pretty low, and it´s a good effort to make analytics more privacy friendly.

So it's practically anonymous. Nobody has enough computing power to brute it & the data is deleted, typically, in 30 minutes.

To de-anonymize the data you don't actually need to brute force the 256-bit hash. If the other pieces of data are known (salt, site, page, day of year) and you can make a shrewd guess at the user agent then you'd only need to brute force the 32-bit IP address().

() Assuming IPv4. Obviously IPv6 addresses would be much harder to brute force, but still easier than a 256-bit hash.

I do take your point but when we get into this area, it becomes a big question of trust. Because if I gave you a hash, you would need to guess the 256 bit salt in addition to all the other possibilities.

I mean, hey, here's a hash: 26246226167b9f190d3a1ce726efe07ae18bbf0480a78d19390b9aaf13f25cb0

Imagine you just got hold of it through a data breach. I'll give you $1,000 if you can get it de-hashed before midnight Chicago time ;)

It's true that you'd need to know the salt first, but the server does know the salt, and if you obtained the hash through a data breach then you probably obtained the current day's salt as well.

So if someone had unlimited access to our servers, that would be a problem. One piece to note is that page views get deleted after around 30 minutes.

So one of the reasons we posted to HN was for conversations like this. Reading what you put makes me think we need to do more when creating the PageRequestSignature and the SiteRequestSignature, because if someone had access to our server, got the hash, stopped all cron jobs from processing data, then they could work it out after a huge amount of time / computing power. But to be honest, at this point, they could also add log($_SERVER) and get the entire request body of a user, so we would have much bigger problems in that scenario.

Anyway, you've given me a few new thoughts on hardening from data breaches. Because, yes, if they know the hash & have full access to our server then it becomes easier. So we almost need to move it to the point where they won't know the salt that was used for a particular user. So we'd not recycle a single salt, we'd recycle multiple salts that are based on, perhaps, the first 2-3 digits of a users IP address combined with the last 2-3 digits.... Then it would be much harder to break without first knowing the $_SERVER dump (the user's IP etc.). Obviously this wouldn't stop a "complete control of server" attack where they could just start logging everything but it would really ruin a brute forcers day because they wouldn't know what salt to start with.

What do you think of that idea? I'm running on little sleep so be nice ;) Also thanks so much for all your feedback so far, it's so appreciated!

The case I would be (slightly) concerned about would be where an attacker has obtained limited, read-only access to your database, so they have the current salt and the hashes related to the last 30 minutes of activity but not the ability to simply log whatever data they want. At that point brute-forcing the 32-bit address space and a small set of common user agent strings would allow them to determine the IP addresses for specific page views.

Of course, if an attacker has access to the salt and is interested in the pages viewed by a specific known IP address then this all becomes much simpler. Then the only unknown is the user agent, which is relatively low-entropy.

> So we'd not recycle a single salt, we'd recycle multiple salts that are based on, perhaps, the first 2-3 digits of a users IP address combined with the last 2-3 digits....

If an attacker is already brute forcing the IP address and has the ability to obtain the salt then they would just use the correct salt for each IP address. Making the salt depend on other data already included in the hash doesn't change the size of the search space.

I think what’s been helpful for us with posting this here is to hear of different ideas for how someone might hack Fathom. When we came up with it, our starting point wasn’t “you have the salt and IP address”, go break a hash. It was on the assumption that you don’t have the salt or IP. I think we can improve what we’ve built. 720 salts improves resilience in a few areas but not in the scenario you are painting here. The scenario you’re painting here has made me think of additional ideas though.

If they had the salt, IP and user agent, they’d have to also brute force every possible hostname and pathname, which would be insane. But I suppose they’d only have to do a few million based on the data we have on hostnames / pathnames...

Lots of ideas for improvement are popping into my head and I love how this community keeps challenging you to improve things. We had feedback on Reddit but it was much angrier!

The next step is to take your feedback and look into how we would defend against the scenario you’ve provided. Thank you!

Maybe I'm missing something but (1) I don't think this is GDPR compliant, and (2) why so complicated?

Regarding (1),

> Brute forcing a 256 bit hash would cost 10^44 times the Gross World Product (GWP). [...]

> We have rendered the data anonymous to the point where we could not identify a natural person from the hash.

> It's possible that GDPR does not apply to Fathom since data is made completely anonymous. Even if GDPR did still apply, we reiterate the stance that there is legitimate business interest to understand how your website is performing.

This seems to imply a profound confusion between the difference of hashing vs. anonymity. Just because it's hashed doesn't mean it's anonymous! You don't need to "brute-force" the hash, you just need to find a user that matches your hash... which is 1 in 7 billion (or so), much more tractable. This is also the principle e.g. MD5 rainbow tables are based on...

They claim to change the hash every 24 hours, so it's equivalent to having a session cookie with 24-hour expiration (session cookies are "anonymous" by their definition, they don't have any user information and they're impossible to "brute force", they "just" enable tracking). I've no idea if 24-hour session cookies are GDPR-compliant...

Regarding (2), given that this seems (again, I might be misunderstanding) equivalent to a 24-hour session cookie, why not just do that? However, then you're ... drumroll ... giving control to the user. Why not just give control to the user, period?! For example, by storing the list of pages visited in Local Storage, and only pinging the server once for each page(view) every 24 hours?

For GDPR compliance notes please see: https://usefathom.com/data/

> You don't need to "brute-force" the hash, you just need to find a user that matches your hash... which is 1 in 7 billion (or so), much more tractable. This is also the principle e.g. MD5 rainbow tables are based on...

Not quite. We use a SHA256 hash as our salt, and that changes each day, so you'd need to brute force that.

In terms of how many possible combinations there are for this salt, please see: https://stackoverflow.com/a/49520766 - you would need to brute force it and try each possible combination with every single possible IP / User Agent / Site combination to break a hash. This is why it's not theoretically impossible but it's practically impossible.

We would love to approach things in an easier way but PECR doesn't want cookies, even anonymous ones.

Now, one thing that we have uncovered thanks to someone on here is that we need to increase our resistance to data breaches. If someone had complete, unlimited access to all our data / servers, including the daily salt, then they could de-hash page views from the last 30 minutes. I have no idea how long that would take. There are 4,294,967,296 (?) possible IP addresss, and then over 3M (?) user agents, so it'd be an absurd, pointless exercise.... Anyway, we're going to be bringing in multiple salts that depend on the user IP address, meaning that, in the event of a data breach, a hacker won't know which salt has been used for a hash :) Perhaps we base the salts on the first 3 digits of an IP address? That would mean we'd have 720 possible SHA256 salts!

You can get an order of magnitude on hash collision resistance by rolling every two hours. Maintain “two” backend databases to gracefully track sessions between roll overs.

Also, for non GPDR IP blocks, maybe just store a per client salt in a cookie(!) and then xor it with the rotating server salt.

I’m not worried about hash collision with sha256. Any reason why I should be? :)

And we can’t use cookies because of PECR!

> Tracking page views alone, without visits, is completely useless and means that you won’t have insight into how many people visit your site / pages each day.

What's the difference between a page view and a visit?

If I came onto your website and refreshed one of your pages 500 times, that would count as 500 page views but only 1 visit.

At first glance this appears to be a well thought out solution. It will never be able to give you some of the stuff that GA can give you but that is by design. The problem that I see is that as long as GA is able to claim they are GDPR compliant there will be very few websites that will see this as a necessity and so adoption will be relatively low. But, and this is just an idea, one of the things that company could do is to proudly present a 'zero retention' button or logo assuming they do not have other trackers on their pages. That way it might become a distinguishing factor for the adopters and that might drive further adoption.

Thanks for building this, I will promote it.

A fantastic idea. We've recently designed a button for websites to show users that they care but I really love this idea of a 'zero retention' button :)

Looks decent, but pricing is insanely high for the extremely limited set of stats.

Totally fair, as that's an opinion :) Luckily our customers are happy with the price, and other folks use the open-source (100% free) version. Cheers!

I actually think your prices are pretty reasonable for business use, but I wonder if you've considered a "personal" plan of, say, $6/month for up to 10k hits, non-commercial use only.

I've been looking for a privacy-first tracker for some personal sites for a while, but nobody is offering pricing that makes sense at the lower end.

The price is equivalent to one hour of developer time per year.

Pricing looks comically cheap for SMB and above.

exactly, expected to be 5-10 times higher.

I think the GDPR was enacted into law not to prevent cookies, but to prevent collecting data on regular people. This seems to circumvent the technicalities of the law but not the spirit. The risk is that they enact a new law that puts even further restrictions on website operators.

I'm not sure this is a good idea.

Appreciate the note and thought here. I do disagree though, as it feels like the spirit of GDPR is to make into law the protection and privacy for regular people. Fathom does this to the best of our ability, and our code reflects our agreement with the spirit of the law.

Analytics is required for business and isn't going anywhere. The laws don't feel like they are trying to shut down analytics completely, they are just asking this type of software to do better. That's what I think we are doing with this—and there are no other analytics companies who come close to our level of obfuscation and non-tracking of personal data.

If the intent of the law is do better with privacy and data, we are doing it to the best of our abilities. It's not a skirting around the issue, we are agreeing with it in our code and logic for how our tracker works.

Thanks for the concern here. We are GDPR compliant (and may be exempt from it). See here: https://usefathom.com/data/

Tell me if I get this correct:

Alice visit a site and gets the hash 1234. The analytics data is stored and associated with hash 1234, but soon after, hash 1234 is removed. However the aggregate visitor analytic that was associated with hash 1234 data persists. Then another user (say Alice again) returns and gets hash 5678. Analytic data is tracked, stored with hash 5678 for the 30 minutes (or less), and then hash 5678 is again removed. However the analytic data that was associated with 5678 is aggregated with the rest?

That's exactly how it works. The purpose being to make it completely impossible to ever single out a user and see which pages they viewed on a website.

You might like to edit the line on that policy page that refers to "the most privacy-focused manor"... while a privacy-focused manor is an interesting idea, I suspect you meant "manner". :)

Equally, I'm not sure what it means to be GDPR "complaint" but I'm thinking it's probably supposed to be "compliant" ;)


It's so exciting that thanks to GDPR we are now seeing innovate analytics solutions that respect privacy.


I am not sure what exactly they did here. How do they persist the hash between requests?

My guess is they use localStorage and sending the hash to their servers with each request.

So we are talking about a mechanism that’s just like a cookie.

As long as they don’t have any PII and can’t figure out who the user was, then I think the GDPR gives them an exception.

But “without cookies” claim is dubious!

It's a session long affair, rather than a persistent affair, with 30 minutes apparently the arbitrary cut-off for a session.

We have been GDPR compliant for many months but our aim here was to meet E-Privacy demands.

We don't use localStorage... Read the blog post, we don't use cookies.

Oh, apologies, you are right! You don't use any kind of cookie mechanism. I just read it again:

  Random SHA256 String (daily regenerated)
  IP Address
  User Agent
  Site ID
  Day of the year
My only question is about this "Random SHA256 String", where is it stored between requests?

Redis Cache. It's a Fathom-wide random string that is used to prevent rainbow table attacks. The salt is refreshed at midnight every day.

Thanks, makes total sense.

So basically the only drawback I see is that all employees from behind one corporate NAT using same browser will count as one user.

But don’t see a way around that if you can only use IP and User Agent strings.

why are you calling it an analytics platform when it isn't one?

This is still logging everything the GDPR says you can't without asking for consent, but you made your search convoluted (but not less efficient if you have all the pieces) to (suggest|lie?) that you need to break the hash and that's why you don't need consent.

None of the information you are using on the hash wouldn't be in the search query itself! ip, user agent, path, date, etc. So there is no way to reverse the hash. You just hash your search query and compare in O(1) time.

The only piece of information that realistically makes the hash slightly difficult to get is the random number refreshed every day. But either you store it (and i have no reason to believe you do not) or it make the brute force effort trivial as I only need to generate the hash with that variable now.

You're focused mostly on Recital 26, which was only a theory of mine, outside of that we are GDPR compliant anyway. I likely shouldn't have included it since that isn't our primary ground for processing. Please see: https://usefathom.com/data/

And yes the daily hash gets stored until midnight. But what are you talking about with 'search query' containing IP, user agent etc.?

If a search query on your data would contain all the components of the original hash, i don't have to walk backwards and break the hash. i just have to hash my query terms in the same way.

Also I suggested you store the daily hash forever. But even if you really erase it every day, as you say, If you or an attacker makes the same request every day at a predetermined time, when you/they get your logs, you/they can use that predictable request to get the daily secret too.

I consider the information to be stored in plain text, and that you would have to have requested permission just the same. You pretty much have an identifiable user (via IP/UA/access time) stored in your logs.

Anonymization is removal of information, not encoding it in a convoluted hash.

So that needs to be our next target point (access logs). We want to move to a position to keep no access lgos.

And a hacker could indeed "win" if they broke into our system, got the salt and exported the DB. We didn't focus on this in our article, as it's unbelievably unrealistic, but it's still possible. Our next step is to address that.

Without the hash, it's practically impossible to brute force.

Not talking about a hacker. I am stating that the described hash dance offers no exclusion from GDPR as saying "we promise we won't look" would do.

My point about brute forcing being useless, is that you hold all the information needed to re-create the hash. All but one tiny piece that is the random number. so brute force is a very effective O(<tiny piece size>). And since it is stored in your locally available data, there is no rate constraints.

> I am stating that the described hash dance offers no exclusion from GDPR as saying "we promise we won't look" would do.

Under your logic, you would never trust us because we could just add $log->write(UserIp, UserAgent, Hostname, Path) in plain text. Trust is very important and what you do with the data is important under GDPR.

And we don't hold all the information to re-create the hash, that's the thing.

I thought a lot about "Oh but you could just do this, this and this" but, no, that argument doesn't hold. Our obligation under GDPR is what we actually do with data.

This is very weak reasoning, because you cannot identify an individual by IP either. This project looks like trying to exploit loopholes. The idea behind GDPR is to make sure companies log only data they need. This project looks into logging the data but without expressing why this is even necessary. Therefore I don't think this is compliant with GDPR.

> because you cannot identify an individual by IP either

Yes you can, particularly if you correlate across different websites.

You are conflating identification of a person by behaviour analysis with matching an ID. What is the ID is irrelevant here - may as well be a hash. That just proves my point that this project is not compliant.

I remember when I learned that IP was considered personal information, I was shocked. But I thought about it and it does make sense.

GDPR is for protection of personal data and we store no personal data. Please take a read of this: https://usefathom.com/data/

I don't believe you have understanding what personal data and GDPR is. You are capturing user behaviour and that is very personal regardless if it is "anonymised" or not - and that is without clear need for doing that. That is pretty much against GDPR.

You come across as somewhat hostile but I'm going to assume good intent on your part, so thank you for the challenges on our stance.

So if you take a look at Recital 26 (https://gdpr-info.eu/recitals/no-26/):

> To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly.

> To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments.

> The principles of data protection should therefore not apply to anonymous information, namely information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.

> This Regulation does not therefore concern the processing of such anonymous information, including for statistical or research purposes.

So the piece about the principles of data protection not applying to personal data rendered anonymous is crucial. We believe that GDPR does not apply to us because of that. But even if GDPR did apply to us (we'll assume it does, that's always the best way to be), then our legal basis is that there's legitimate interest. As a website owner, it is in your legitimate business interest to understand how your website is performing - e.g. the most popular pages, the pages where people linger for longer, the pages where people bounce.

Article 4 (1) states:

‘personal data’ means any information relating to an identified or identifiable natural person (‘data subject’); an identifiable natural person is one who can be identified, directly or indirectly, in particular by reference to an identifier such as a name, an identification number, location data, an online identifier or to one or more factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person;

- a hash number falls into this. You cannot just quote recital 26 and stop reading since you found it fitting. Recital 30 covers the case for "other identifiers" that might replace cookies. No hard feelings we all do.

The data might be anonymous for a third party but if you can single out just one person or in other word one unique visitor it is not anonymously. NB. One IP poisons the whole data.

So your fallback is Article 6(f) which is reasonable but you can not assume the interest of a site owner is always higher than the interest of the visitors. You have to put your arguments into writing and have the means for people to appeal it. 6f is not meant as a blanco cheque or batch job...

> Online identifiers for profiling and identification > Natural persons may be associated with online identifiers provided by their devices, applications, tools and protocols, such as internet protocol addresses, cookie identifiers or other identifiers such as radio frequency identification tags. > This may leave traces which, in particular when combined with unique identifiers and other information received by the servers, may be used to create profiles of the natural persons and identify them.

The thing is, you can't create profiles. So right now I could give you a single entry for our website

> NULL "" "https://www.usefathom.com" "bb9377f4cf33093765835a48e962a5dbd3168499abd12b120c8c118c86c41479"

How could we possibly use that to profile / identify? The hash (bb9377f4cf33093765835a48e962a5dbd3168499abd12b120c8c118c86c41479) is unique in the database table and never repeats.

I hear you. We don't rely on Recital 26 to comply with GDPR. I've not had the Recital 26 piece confirmed by a lawyer but it's a personal hunch / exploration. Hearing your comments on Article 30 are helpful, thank you, would like to hear your thoughts on my reply if possible :)

If you dont need the identifier, why don't leave it out at all (Art 5c)? Or is just in it this case unique?

Identifier is used for previous view only. Previous view is updated when a new view is inserted into the temp table & previous view's user identifier is wiped.

They don't have any PII and are therefore not subject to GDPR. They have data that, if it were not anonymized, would be PII, but it's anonymized and therefore isn't.

It doesn't matter if it is an IP or another identifier e.g. a hash. Person can be identified by behaviour and this is not anonymised.

How can a person be identified from a hash by behaviour? We built the software but you seem to know something we don't...

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact