
Ask HN: Privacy-centric and ethical analytics solutions? - jamieweb
I would like to gain some high-level insight into the traffic accessing my website. For example:<p><pre><code>  - Unique visitor counts
  - Most viewed pages
  - Referring sites
  - Activity per time of day&#x2F;week&#x2F;month
</code></pre>
I do not want to be able to track individual users - I want to keep this strictly to statistics rather than intrusive tracking. That throws out pretty much anything that involves JavaScript or stuff done on the client-side.<p>I&#x27;ve been trying to put together a solution using the AWStats log analyser, however this requires me to collect IP addresses. If I remove or obfuscate IP addresses, then the &#x27;Unique Visitors&#x27; count doesn&#x27;t work. Unfortunately it seems that AWStats uses IPs as the primary method for identifying unique visitors.<p>What other solutions are out there? My site is PHP so doing something myself would also be acceptable.
======
harianus
I have built a platform that exactly does that. It does require JavaScript for
a few reasons:

1\. It allows single page apps to analyze

2\. Caching of pages does not have any effect on the JS to be executed. Most
back end tracking don't know if something is visited when cached.

So I would recommend you to use JavaScript if above reasons apply to you. As
far as I know you can't really obfuscate the IP address in a why that you
can't track a visitor. That's why I decided to drop IP address from our logs
and don't use them at all.

Regarding your last point: unique visitors are hard to measure if you don't
use IP or a cookie. A cookie is tracking, be not sure how intrusive you think
this is for you. It could be a cookie with just a value of 'visited=1' or
something, so you know it's a non-unique visitor when the cookie is present.
That way you don't track I think.

You can see demo stats of my platform here
[https://simpleanalytics.io/simpleanalytics.io](https://simpleanalytics.io/simpleanalytics.io)

~~~
jamieweb
It was actually your Show HN from a few months ago that prompted me to add
'analytics' to my to-do list.

For the past years of running my site, I've had basically no analytics. I have
the site in Google Webmaster Tools which shows stats on the Google clicks, but
other than that there isn't much. At the moment traffic from everywhere but
Google is completely unaccounted for.

I also disabled web server logging completely in May when GDPR came in.

Simple Analytics looks really good, and it's on my personal list of 'cool
startups to use in the future', however for my particular site, there are a
couple of issues:

\- Adblock users

\- JavaScript

The reason I'm so interested in log analysers is that the data is closer to
network/traffic statistics than it is analytics/tracking. I think that a
massive portion of the traffic to my site will be with Adblock, so if all of
it is unaccounted for then the numbers will be way off.

Also, my site is strictly JavaScript free. It does look like you offer a
<noscript> version though. My site is locked down with a super-tight Content
Security Policy, and including an external JS file would be too big of a risk
in my opinion.

As far as I can see, Simple Analytics isn't able to count unique visitors.
This is what makes it so good when it comes to privacy and security, but it's
a stat that I'd really like to see. I've started putting together my own
solution using a bloom filter for this, as nikonyrh suggested in this thread.

So overall I think that Simple Analytics is great and I would love for more
sites to adopt it, rather than going down the guy with a camera and notebook
route (as shown in your promo video!). However for my particular project, it
isn't suitable at the current time. As I said though, it's on my list and I
can definitely see it been useful for other projects that I may be involved
in.

Thank you :)

~~~
harianus
Thank you for your kind words! I think you like to build something cool here
for this website.

To reply on your points:

\- Adblock users are definitely a portion and some of them are blocking major
trackers. Unfortunately also Simple Analytics. But I’m building a proxy so
that people can link a CNAME if their domain to Simple Analytics.

\- Simple Analytics has indeed a noscript version. It removes the ability to
get the referrer though.

\- Uniques without tracking. Yes. It’s possible. Will keep HN posted

Good luck with the bloom filter!

------
nikonyrh
You could hash the ips before storing them, but as there aren't that many IPv4
addresses wit would be trivial to revert.

However if you use bloom filters to calculate "distinct counts" then I think
you cannot reliably re-construct visitors ips. You gotta do some planning in
advance on implementation details so that you can extract the stats you are
looking for.

~~~
jamieweb
Thanks for the suggestion, this has got me thinking...

I had thought about hashing IPs and user agents combined, but even that would
be quite reversible since user agents aren't that unique really.

I've done some research on bloom filters and this looks like a good lead. It
could be challenging to implement with AWStats though, as I guess I'd have to
perform the bloom filter logic before writing the log, and then write a fake
IP address accordingly in order to keep the Unique Visitors count consistent.

For example, if the bloom filter says that the IP is not known, then I could
write 10.0.0.2 to the log and increment a count. Next time a not known IP
comes in, 10.0.0.3 is written, then 10.0.0.4, and so on.

If the IP is already known, then the unique visitor has already been counted,
so just write 10.0.0.1 to the log.

The result of this is that all log entries would be for 10.0.0.1, except for
those where a new visitor had connected for the first time, which will be an
arbitrary IP from the 10.0.0.0/8 range.

Have I got the right idea here, or is there a better way? This is my first
rough concept.

~~~
nikonyrh
Ahaa, I didn't take AWStats into account. I was thinking more in terms of "one
bloom filter per unique thing you want the distinct count out of", for example
1 per day & IP & User agent. I guess your approach would work fine for quite
"empty" bloom filters, but if your filter is too large it becomes again more
plausible to reverse the IPs from it.

You get more accurate counts from
"[https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the...](https://en.wikipedia.org/wiki/Bloom_filter#Approximating_the_number_of_items_in_a_Bloom_filter")
formula. And I'm not even sure what to do if you have under-sized your filter
and suddenly you get lots of unique visitors, I guess you'd need to create a
larger filter on-the-fly.

Btw Redis supports bloom filters, so you don't have to worry about the in-
memory implementation ;) [https://redislabs.com/blog/rebloom-bloom-filter-
datatype-red...](https://redislabs.com/blog/rebloom-bloom-filter-datatype-
redis/)

~~~
jamieweb
I'm looking to also rely on k-anonymity a bit for this.

If I have a bloom filter that is 128 KiB (1,048,575 bits) in size and I use
the first 5 characters of a SHA256 hash as the identifier, then there's only
1,048,575 possible unique values that can be stored.

The total number of publicly routable IPv4 addresses is 3,706,452,992 - so
that means that for each bit in the bloom filter, there are an average of
3,535 possible IPs that it could relate to.

In other words, if you were to brute force the bloom filter with the hashes of
every publicly routable IPv4 address (which wouldn't take very long since it's
only 3.7 billion), the average accuracy would be 1 in ~3500.

This means that IPs that share the same 5 starting SHA256 characters wouldn't
be counted, but that would only be 1 in ~3500, so not a big program. At worst
it would result in a lowered unique visitors count - which is much better than
a bloated one.

For IPv6, the same applies but even better, as there are magnitudes more v6's
than there are v4's.

> But if your filter is too large it becomes again more plausible to reverse
> the IPs from it

Yes, I've been considering this. If I were to use a larger filter, I'd need to
take more characters of the SHA256 hash in order to match the total size of
the filter. If I used the first 10 characters of a SHA256 hash rather than
just 5, then that is easily enough possible values to uniquely identify each
IPv4 address. If I used 10 then the bloom filter would be trivially brute-
forceable. With only 5 characters though, it's a 1 in ~3500 accuracy rate as I
mentioned above.

> I guess your approach would work fine for quite empty bloom filters

By this did you mean 'small' filters (i.e. by file size), or filters that just
aren't that populated? If the latter it should still be fine as long as the
size of the filter allows for a good k-anonymity ratio (1 in 3500, etc).

