
Privacy or Data, a Convenient False Dichotomy - __ka
https://www.0x65.dev/blog/2019-12-02/is-data-collection-evil.html
======
nocturnial
> Let us start with the need for data, going straight to the point: one cannot
> build a competitive search without data collected from people.

Google had a very good search engine before they started to collect data from
everyone, not after.

I don't have a metric to decide whether the search results from google got
better or worse over the years. Subjectively and for me personally, they've
gotten worse. And I'm not the only one who thinks this. I could be (and
probably am) wrong, but I think the more personal data a search engine has,
the worse the results.

When I think about it in that way, it makes senses why it has gotten worse.
The search engine pays more attention to the data it think it has collected
from me and how it relates to advertisers, etc... instead of paying attention
to the actual words I wrote in the query.

~~~
missosoup
Google search results have definitely gotten worse over time. I can think of a
few reasons.

* Search is now optimising to direct users to where Google wants users to go rather than where users want to go (i.e. to sell shit)

* Search is now heavily censored to be 'mainstream'

* Google disproportionately biases search towards ideologies and political leanings that Google prefers (remember that time they arbitrarily decided to remove all results for guns and inadvertently also made it impossible to search for water guns?)

Switching up to foreign search engines like baidu and yandex really opened my
eyes to how heavily Google manipulates what I get to see.

Google search in the 2000s was significantly more useful than it is today, but
most of it has nothing to do with collecting personal data and more to do with
Google transitioning from a search company to a pseudo internet hegemony.

A better example of personal data collection leading to worse results is
Youtube. Youtube puts people into weird tiny bubbles of content that the
algorithm thinks will get viewing time from that particular individual, and
prevents that individual from easily discovering vast swaths of other content
they may find interesting. Anyone else have a Youtube home page full of videos
from 3-10 years ago that they've already seen?

Google is starting to feel like the paperclip optimiser. I'm sure the way its
products work is currently optimal for whatever metrics they set, but it's
certainly not optimal for me as a user. Google makes me feel exploited.

~~~
summerlight
Or Google simply could not keep up with the pace of web growth. The web is
already 100x bigger than it used to be 10 years ago and the external
environment is actively gaming with Google (SEO, Spam, etc...) while the
information available from a user query is still effectively same.
Personalization may add more implicit information to a query (e.g. If you're
tech-savy, then a keyword "Rust" more likely means PL than game), but most of
the time it lacks the actual context so its usefulness is somehow bounded.
Most of the useful information still comes from a query itself. Hopefully,
advancement in the field of NLP may allow a longer query to be more effective
but I don't see any straightforward way to keep the search quality with a 1~3
words query.

------
amelius
The question is why do companies even have to see our data? In the old days of
the internet universities and government institutions dealt with data, the
companies were only there to provide the hardware.

~~~
__ka
I understand why universities would get the benefit of the doubt, but I am not
sure having search engines be a government service would be a very good idea.

~~~
amelius
Yeah, many problems to solve here.

But let's take another example: Google going into the space of deep learning
based health diagnostics. Why does Google need to have access to our medical
data when they can sell their solutions as hardware and software to hospitals?
Even if they have to use our data for testing purposes, they could sign an NDA
so the use of our data would be severly restricted and they would not _own_
our data.

And for many other services, _owning_ our data should not be the default, and
even the necessity of _seeing_ our data should be questioned.

~~~
__ka
I agree. That data should stay with their rightful owners - the data subject.

I believe that we have fundamental issues with personal data ownership in the
web for two reasons:

1\. People do not believe web is real life. In the physical world, it is very
easy to see how your rights are violated. If there's a person following you
for days (when you shop, when you buy your medicine, when you talk to
friends), you call the police. The majority of users have no idea of third
party trackers. They have no idea what (or rather how much) information they
are emitting at each point in time.

2\. People have access to amazing products for "free", and they do not know
the price of their data. Imagine if you were given a TV, but you would have to
babysit a boring kid 2 hours a day for years. I guess not many would want that
TV. Force each company to offer two plans (legally): one free + (ads /
tracking), one premium (no ads / no tracking), then see how much people care
about their data.

~~~
autoexec
> Force each company to offer two plans (legally): one free + (ads /
> tracking), one premium (no ads / no tracking), then see how much people care
> about their data.

I like this idea but only if you force companies to price the non-free version
no higher than the actual monetary value of the data they collect. Otherwise
you'll have Gmail's "We'll stop reading your private emails" plan priced at
100,000 a month. I don't want to create a situation where only the ultra
wealthy can protect their right to privacy.

------
summerlight
While the approach mentioned in this article seems interesting, there's
multiple other approaches to mitigate privacy issues in data collection,
cryptographic approaches like homomorphic encryption or decentralized
approaches like federated learning. Still wonder what is the differentiation
of this approach from the others.

~~~
solso
[Disclaimer: I work at Cliqz]

Yes, we are not the only approach. Even when we started collecting data back
in 2014 it wasn't the only one. But we did not find any suitable off-the-shelf
solution back then.

Homomorphic encryption was discarded because some data to be send needs to be
on the clear. For instance a url we need to fetch, so cannot transform it.
Also, computationally is very expensive.

Federated learning is actually closer to what we do, take our approach as
federated learning where each node is a single user and where all aggregation
of records need to happen there, so that record-linkage on the final collector
is impossible.

Tomorrow and the day after tomorrow we are releasing the technical details on
different blog-posts, this one was more a motivation/introduction to the main
dish.

~~~
strbean
How about differential privacy?

~~~
philippclassen
(Disclaimer: I work at Cliqz)

Just saw this one. It is an old comment, but let me try to answer as I find
the question interesting.

The post on Human Web ([https://0x65.dev/blog/2019-12-03/human-web-collecting-
data-i...](https://0x65.dev/blog/2019-12-03/human-web-collecting-data-in-a-
socially-responsible-manner.html)) has a brief section regarding differential
privacy. Maybe check that one out first.

My take on it: although we do see value in differential privacy, we do not
believe it fits well in our particular case. The critical moment is to decide
what data should be sent by the client. Once data it is out, it is out. It is
not possible to apply anonymization once it is on the server. If someone knows
how it can be done safely, I would be highly interested.

We consider our chosen approach - breaking record linkage before sending -
safer for our use-case and simpler. Do not underestimate the simplicity
argument. Differential privacy is a powerful technique, but it is also very
complex; there are lots of pitfalls and it is crucial to make good choices for
the parameters.

Would be a good topic for another blog post. ;-)

------
gfawke5
seems to be like day 2 of the advent.

from day 1: The world needs more search engines [1].

[1] [https://www.0x65.dev/blog/2019-12-01/the-world-needs-
cliqz-t...](https://www.0x65.dev/blog/2019-12-01/the-world-needs-cliqz-the-
world-needs-more-search-engines.html)

~~~
pythux
Indeed, the idea is to use each day until Christmas to unveil a new post about
search, privacy and security. Today is about opening a discussion about data
collection practices. It's often presented as a "all or nothing" (more often
than not it's "nothing"). But data is needed to create independent and viable
alternatives to the giants so maybe there is a middle ground if it's done with
privacy at heart from the ground up.

Edit: For those curious about the details, we've started a new page where we
gather research, talks and blog posts from the past about these topics:
[https://0x65.dev/pages/dissemination-
cliqz.html](https://0x65.dev/pages/dissemination-cliqz.html)

------
soumyadeb
There should be a way to store data so that aggregate queries like "count how
many different people are in a given location" can be answered without
compromising individual privacy. Approaches like k-anonymity or L-diversity
come to mind

The problem is that there is no incentive for businesses to do so. Even
regulations like GDPR & CCPA mostly don't touch upon the privacy of first-
party data. As long as you give an option to delete the data (which often is
hard to implement), you don't have to do anything more.

~~~
kkm
Hi soumyadeb,

It is true that there needs to be a way to store the aggregate queries "per
user". In the current approach of collection this place happens to be on
Server, but aggregation per user can easily be done on client-side by
leveraging Browser storage.

Approaches like k-anonymity or L-diversity are good but they tackle the
problem from a different perspective - making sensitive data available for
querying without revealing actual content. The approach suggested in this
article talks about methodology which removes the need to collect such data in
the first place.

You can also check our paper: [https://www.0x65.dev/pages/dissemination-
cliqz.html#GreenAna...](https://www.0x65.dev/pages/dissemination-
cliqz.html#GreenAnalytics)

We will talk in detail about this methodology- Human Web and how we use to
collect data for our search, without compromising users privacy and based on
client-side aggregation.

Disclaimer: I work with Cliqz.

~~~
soumyadeb
Thanks for the reference. I quickly browsed through the paper. The problem you
mention seems to arise from the fact that GA is able to tie the user across
different domains (about.me, depressionforum.org etc) using a shared cookie
etc. Is it still an issue if 3rd party cookies are blocked and the GA is
forced to set a first-party cookie. In that case, the ID GA will get for
about.me would be different from depressionforum.org? Assuming the owners for
these websites are different and only care about their respective stats
(depressionforum doesn't care about about.me analytics), why do we need 3rd
party cookies?

Am I missing something here?

Well, it can probably fingerprint the browser but there is no reason to store
that information.

~~~
soumyadeb
OK got it. You guys @ Cliqz want aggregate stats across websites so GA is not
the right example.

Depending on the aggregates you want, client-side aggregation may still be
leaking privacy. You would probably need to implement differential privacy on
top before you store the data.

~~~
solso
Of course client-side aggregation can still leak privacy, it needs to
guarantee that there are no explicit or implicit elements that would allow
record-linkage on the server-side. On the diff. privacy, note that you mention
before you store the data, the problem is not only there, we want to prevent
the data to be send at all. Diff privacy on the client-side is tricky if
distributions are unknown. That's why we go for a "simpler" approach, all
records send by any user should be unlikable, always. If aggregation is needed
to satisfy the use-case, it can only be done on the client itself. Re-
identifiability only becomes possible if a mistake is done, no a priori
distributions are needed. IMHO this approach is easier than diff. privacy at
the cost of being less expressive. Diff. privacy data allow for multiple use-
cases where we do not (as all records) have to be independent from one
another.

I'm sure it's obvious that I work at Cliqz and on this very topic :-) Tomorrow
there is a more technical article about our data collection, hope you like it.

Also, I would like to add that any methodology applied to protect the data of
the user is welcome, does not have to be ours at all. There is one caveat
though, the privacy protection has to be on origin, a solution that send data
that then has to be anonymized is no good in our book; because there is no
guarantee that the raw data is removed.

For our use-case we believe it was the easier way to get the data we needed
while respecting the privacy of the user.

