Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Server-Side Tracking Without Cookies in Go (marvinblum.de)
19 points by marvinblum on June 28, 2020 | hide | past | favorite | 22 comments



I once built a cookie-less tracking app without using IP address or user agent. The idea came from this[0]. It would serve a file to the user with an etag that let the browser cache the file for an indefinite period of time (effectively acting as a cookie but not being classified as one). There might be some quirks as the caching behaviour is not guaranteed (I think the caching is discretionary to the browser based also on the available cache). But it is an interesting approach to use something that is similar to a cookie but technically isn’t.

Also user agents are not unique[1] but combined with IP addresses should be pretty safe to assume the fingerprint generated from both is.

[0] https://lucb1e.com/rp/cookielesscookies/ [1] https://www.eff.org/deeplinks/2010/01/tracking-by-user-agent


I first saw ETAG method on evercookie project. HTML5 Canvas is also a funny/smart technique that I saw there. https://github.com/samyk/evercookie

One of the purposes of the evercookie project was to have a collection of tracking methods that browser vendors can test against their implementation.


Ah yes I've read about that method too, but I think it's saver to use the user-agent and IP address as that's probably more reliable. The browser might ignore the ETag.


The cookie law covers evercookies like this, so it's not a way to get around that or the GDPR, just around people who turn cookies off.


didn't know that - thanks


> Here are the parameters used by Pirsch to generate a fingerprint:

> * the IP is the most obvious choice. It might change, as IPs are reused by ISPs as they only have a limited pool available to them, but that shouldn't happen too frequently

> * the User-Agent HTTP header contains information about the browser and device used by the visitor. It might not be filled though, but it usually is

A couple of questions:

1. Does anyone know of a decent Firefox plugin that can subtly change the User-Agent string for each new tab? I'm not thinking changing it to a different browser rendering engine string but maybe add a random key like

    hash-buster/$randomstring
2. Isn't Chrome making its User-Agent string static? (ref: https://www.osnews.com/story/131177/google-is-seeking-to-dep...). @marvinblum (looks like you're the maintainer?) wouldn't this then skew your results?


Oh well that's sound bad for my solution. I'm unsure how well it would work with just the IP, but it's possible to add more headers, like the accepted languages and file types. If you really need precise tracking you probably won't get around tracking on the client side.


I'm not sure if I missed some other explanation around IP addresses, but aside from rotating IPs from ISPs this would have trouble in shared networks where people use the same browser and/or OS so the IP and agent string would be the same. Say, an office or home network. Any way around this using other data perhaps? Not dissing, just curious as to the server analytics space atm. With Chrome taking most browser traffic and Windows for desktop OS (Android on mobile, globally) there may be frequent false positives as such. Best next assumption as a "solution" is that one wouldn't expect all network users to visit the same site :p


You can compensate for it a bit by reading the X-Forwarded-For header. In case someone sits behind a proxy, they usually set this header with the actual IP address of the client. You can see this here: https://github.com/emvi/pirsch/blob/master/fingerprint.go#L2...


Behind a proxy that is (optionally) configured to share this - yes. But what of a NAT[1] gateway.. most home (and many business) networks NAT your connection - to the destination server, all devices in that network have the same IP (or the IP is seemingly very dynamic if measuring hash(agent+ip)). Other than a server or a mobile-device on network data (assuming they aren't using IPv6) an IP will almost always represent a cluster of devices.

[1]: https://en.wikipedia.org/wiki/Network_address_translation


Hard to tell how often that happens. The User-Agent should help a lot with that. I will collect more data and try to figure out how well it really works.


That was a good read. And your dark mode felt like a massage for my eyes


Thank you! I use concrete [1] for styling that does take care of it. My favorite micro CSS framework :)

[1] https://concrete.style/


Hey Hackers! I built a server-side tracking lib in Go, please let me know what you think.


I'm already seeing a lot of of bot traffic via HN and started a list here: https://github.com/emvi/pirsch/issues/1 Feel free to add more!


Perhaps you want to look at the isbot npm package. They already have a good list.

https://github.com/gorangajic/isbot/blob/master/list.json


Thanks!


Note: if your fingerprint can be used to identify a specific person in combination with other data (it can be), you're still subject to GDPR on this (assuming you're within jurisdiction). It's probably entirely reasonable that, given the salt that gets regenerated each midnight (and which is stored separately from the dataset), and the lack of further processing done on the dataset to match it with specific people, this is a legitimate interest, but you'll still have to mention it in your privacy policy.


Can you explain how a person can be identified by the fingerprints I'm generating? I know it's possible to track the page flow of individual visitors, but as Pirsch does not store the IP or any information that is unique to someone, I don't know how I could trace that back to a person.


If you have the IP and user agent in some other system (for example, if the user hits your website again, or if they make an access request where you could verify this data), you could take the salt out of Pirsch and combine the three to regenerate your fingerprint, thus tying the hit data to the user. The fact that the salt exists entirely in-memory within a defined system (unless it were, for example, within an HSM which would be extremely difficult to get the salt out from) would seem to be irrelevant to the definition of personal data within the GDPR - although I'd be interested to see discussion of that. The data stops being personal data once you drop the salt entirely.

I'm also fairly certain that a trace of a single user from A through B (e.g. they hit the "payment confirmed" page at a specific time, having hit a specific product page before) could be correlated with a user in a purchases/shipping database and from there to a name and address, depending on the overall system Pirsch gets used in. Bringing separate datasets into the same tooling to query them isn't particularly hard. If it's used on a simple blog with no other tracking or systems that store this data, this argument probably doesn't hold water.

Pseudonymising data at the earliest possible point fulfils another part of the GDPR (to do with controls and protection of personal data), but doesn't in itself make something not personal data, according to the UK's ICO.

Recital 26 of the GDPR states - "To determine whether a natural person is identifiable, account should be taken of all the means reasonably likely to be used, such as singling out, either by the controller or by another person to identify the natural person directly or indirectly. To ascertain whether means are reasonably likely to be used to identify the natural person, account should be taken of all objective factors, such as the costs of and the amount of time required for identification, taking into consideration the available technology at the time of the processing and technological developments."

But the fact of the matter is that there is so little risk here that it's likely to be a legitimate interest for the vast majority of businesses, thus not requiring consent (but still requiring adherance to the rest of the GDPR on personal data), again according to the ICO.


Thanks for the explanation. I guess I should add a secret to the generated hash. That would prevent anyone without access to it to regenerate a similar hash.

As you said it's still a good idea to inform the visitors about the anonymous tracking. I'm pretty sure that most websites require a cookie note anyway, no matter how annoying it might be.


Most websites probably don't actually need a cookie notice, at least up-front - cookies that are "strictly required" for the website to function and are not used for non-required functionality are exempt, something that most people miss because the vast amount of discussion around this is around user-tracking advertising.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: