
Ask HN: How to get aggregated user behaviour without tracking an individual user - harianus
For simpleanalytics.io I don’t want to track individual users and I would love to give businesses insights in the combined user behaviour.<p>A business could ask: &quot;How many visitors converted from DDG to sign up and what is the average duration?&quot; To be able to calculate the conversion between landing and signup you need to know the history of events.<p>Let&#x27;s say we have a few events including:<p>- Page view event with referrer DDG<p>- Signup event<p>The data could look like this:<p><pre><code>  [
    [&#x27;&#x2F;&#x27;,&#x27;mysite&#x27;,&#x27;ddg.com&#x27;],
    [&#x27;signup&#x27;,30]
  ]
</code></pre>
Explained:<p><pre><code>  [
    [event, your website, referrer],
    [event, duration since first event]
  ]
</code></pre>
When an event happens I add it to a function session cookie (exp 30 min) and send the complete cookie to the API. The time of the first request will be stored in another cookie and never send the API.<p>The two requests from the above example looks like this:<p><pre><code>  [[&#x27;&#x2F;&#x27;,&#x27;mysite.com&#x27;,&#x27;ddg.com&#x27;]]

  [[&#x27;&#x2F;&#x27;,&#x27;mysite.com&#x27;,&#x27;ddg.com&#x27;],[&#x27;signup&#x27;,30]]
</code></pre>
When the first request happens it gets added to the database (see row 95):<p><pre><code>  id | time     | event  | site        | referrer | link
  94 | 20:30:20 | &#x2F;      | mysite.com  | ddg.com  | NULL
  95 | NOW()    | &#x2F;      | mysite.com  | ddg.com  | NULL &lt;---
</code></pre>
The second request contains the information of the first request. When a request comes in with more than 1 array item it will look for the previous events in the database. It will look for a row where event=&#x2F;, referrer=ddg.com, site=mysite.com, and time is &gt;30 min ago: row 94. The table after adding the row will look like:<p><pre><code>  id | time     | event  | site        | referrer | link
  94 | 20:30:20 | &#x2F;      | mysite.com  | ddg.com  | a
  95 | 20:38:28 | &#x2F;      | mysite.com  | ddg.com  | NULL
  96 | 20:30:50 | signup | mysite.com  |          | a    &lt;---
</code></pre>
The conneted row can be 30 min off, but I think that&#x27;s okay.<p>Do you think this is acceptable from a privacy perspective?
======
Chrissvo
That's a nice challenge!

If you're super distrustful you could argue that you should never store a
timestamp with a signup event, because it could potentially reveal a user's
identity...

Here's a crazy thought, what if you would do this:

1\. You fire off a default first event, say “init" On the server you generate
a PGP key pair, store the private key with the init-event and return the
public key

2\. Second event (first real event) is fired by the website owner and
encrypted with the PGP public key from 1

3\. On the server you try decrypt event #2 with all available active private
keys (stored with init-events)

4\. Once a solution is found you link the 2nd event to the 1st event, delete
the private key of the 1st event, generate a new PGP key pair, store private
key with 2nd event, and return the new public key

5\. Third event is encrypted with the public key of 4 and...

No need to store timestamps and all traffic is encrypted, now how to make step
3 fast?

~~~
harianus
Thanks! I like the way you think.

2\. I think encrypting PGP is pretty heavy and maybe not great for the
performance of a script that loads on a lot of websites.

3\. I'm not sure how fast this will be. Especially on a very busy website with
lots of page views per second.

Basically you could also store a variable with the event and send that
variable back. What would be the added value to use PGP encryption?

------
harianus
I don't want to use a session cookie with an ID to link all events. I don't
want any ID because I could potentially link those ID together in the back end
based on IP (I don't, but I want people not to have to trust me). I want to
make sure I don't get any data that my system could use wrong.

~~~
vokep
I think maybe a good way to go is a compromise - Since you're already taking
efforts to protect privacy without needing to trust you, thats already a good
start. But maybe you need some kind of ID to tie behavior together, so you do
record one temporarily, until you've processed it into an aggregate
(anonymized individual behaviors)

Basically, train a machine learning model on the data of invididuals. You
don't want to overfit or that could be de-anonymizable, but a slightly
underfit model could capture most of the important patterns, while throwing
out most of any identifying aspects.

The hard part then becomes finding a way to demonstrate this actually is
happening so that you can be trusted. Unfortunately I can't think of a
provable way, since you pretty much either can track users by IDs or not. And
if you do..then trust has to be assumed

~~~
harianus
But with my solution in the main Ask HN I don't need any ID. So why should I
not do it that way?

------
harianus
And what would it be from a privacy perspective if I set a cookie for 90 days.
I can't link this to any personal information and my customers will only see
my tool where they can see the conversions (they don't get access to the
"link" in the tables above).

------
nartz
Differential privacy.

~~~
harianus
Not really, that is more for when you have sensitive data and want to show
that data publicly. I want to have only insensitive data and make sure I don't
get sensitive data from the visitors.

