Hacker News new | past | comments | ask | show | jobs | submit login
Privacy or Data, a Convenient False Dichotomy (0x65.dev)
76 points by __ka 3 days ago | hide | past | web | favorite | 25 comments

> Let us start with the need for data, going straight to the point: one cannot build a competitive search without data collected from people.

Google had a very good search engine before they started to collect data from everyone, not after.

I don't have a metric to decide whether the search results from google got better or worse over the years. Subjectively and for me personally, they've gotten worse. And I'm not the only one who thinks this. I could be (and probably am) wrong, but I think the more personal data a search engine has, the worse the results.

When I think about it in that way, it makes senses why it has gotten worse. The search engine pays more attention to the data it think it has collected from me and how it relates to advertisers, etc... instead of paying attention to the actual words I wrote in the query.

Google search results have definitely gotten worse over time. I can think of a few reasons.

* Search is now optimising to direct users to where Google wants users to go rather than where users want to go (i.e. to sell shit)

* Search is now heavily censored to be 'mainstream'

* Google disproportionately biases search towards ideologies and political leanings that Google prefers (remember that time they arbitrarily decided to remove all results for guns and inadvertently also made it impossible to search for water guns?)

Switching up to foreign search engines like baidu and yandex really opened my eyes to how heavily Google manipulates what I get to see.

Google search in the 2000s was significantly more useful than it is today, but most of it has nothing to do with collecting personal data and more to do with Google transitioning from a search company to a pseudo internet hegemony.

A better example of personal data collection leading to worse results is Youtube. Youtube puts people into weird tiny bubbles of content that the algorithm thinks will get viewing time from that particular individual, and prevents that individual from easily discovering vast swaths of other content they may find interesting. Anyone else have a Youtube home page full of videos from 3-10 years ago that they've already seen?

Google is starting to feel like the paperclip optimiser. I'm sure the way its products work is currently optimal for whatever metrics they set, but it's certainly not optimal for me as a user. Google makes me feel exploited.

Or Google simply could not keep up with the pace of web growth. The web is already 100x bigger than it used to be 10 years ago and the external environment is actively gaming with Google (SEO, Spam, etc...) while the information available from a user query is still effectively same. Personalization may add more implicit information to a query (e.g. If you're tech-savy, then a keyword "Rust" more likely means PL than game), but most of the time it lacks the actual context so its usefulness is somehow bounded. Most of the useful information still comes from a query itself. Hopefully, advancement in the field of NLP may allow a longer query to be more effective but I don't see any straightforward way to keep the search quality with a 1~3 words query.

> Google is starting to feel like the paperclip optimiser. I'm sure the way its products work is currently optimal for whatever metrics they set, but it's certainly not optimal for me as a user

This makes me think of Goodhart's Law (https://en.wikipedia.org/wiki/Goodhart%27s_law)

> "When a measure becomes a target, it ceases to be a good measure."[

[Disclaimer, work at Cliqz]

Comparisons of quality are relative, Google was good from day one because it was better than the competitors at that moment of time. Problem is that if one starts using X, and that provides an edge, the other have to follow just to keep up. A very unfortunate rabbit hole.

I'm with you 100% that personal data makes search worse in in many cases, too much personalization is detrimental.

One last thing, which I'd like to stress, is that data from users is not necessarily the same as personal data. For instance, you can use the data from users as if they were just sensors, without trying to build sessions out of it.

In a way it's similar to what Google made when they started. Google was the first and only to use anchor text, which is user generated and a less noisier description of the content of the page. (True that they did not use the users themselves but they did the proper automatic crawling). But in a way they collected sensor data from users, not personal data. That's what we should still be doing today. The problem is , however, however, is that is too easy to go beyond "sensor data" and start to collect full sessions and even personal data.

> I'm with you 100% that personal data makes search worse in in many cases, too much personalization is detrimental.

This isn't even my main objection.

Shouldn't data collection be reciprocal? I submit data you can use and you submit the data you extracted and used from me?

The ideal would be when I submit a search query to your server and it would respond "I have collected (this) data from you" "I have put for auction data.{xyz} on market Y" "The companies bidding were a,b,c and they offered amount g,h,i" "Company U won the bid, that's why you're seeing this ad".

If you think about it, there's no way someone would be transparent enough to give this data to the users.

If "you" aren't willing to share your "sensor" data, why should we be forced to share ours?

Probably lots of people remember the good old days of Google Advance Search. Or the ability to fine tune the results with special operators like +, -, site:, etc. It was probably 2010 when we started to talk about filter bubbles and how "smart" algorithms adjust the data to our liking. No one even though about disasters like Cambridge Analitica. As those days are long gone, there is perspective that is not purely negative. IMAO modern search engines are more approachable. They are easier than ever to use. Specifically for people that don't necessarily know how or have patience to learn how to use it. We are power users and often forget that most people don't have to be ones. They just want to search.

Therefore approach that Cliqz takes to its data collection makes even more sense. What Human Web does is extraction of common knowledge (or sense) on what does Web mean to people. Not to you, not to me, but to all of us.

The question is why do companies even have to see our data? In the old days of the internet universities and government institutions dealt with data, the companies were only there to provide the hardware.

I understand why universities would get the benefit of the doubt, but I am not sure having search engines be a government service would be a very good idea.

Yeah, many problems to solve here.

But let's take another example: Google going into the space of deep learning based health diagnostics. Why does Google need to have access to our medical data when they can sell their solutions as hardware and software to hospitals? Even if they have to use our data for testing purposes, they could sign an NDA so the use of our data would be severly restricted and they would not own our data.

And for many other services, owning our data should not be the default, and even the necessity of seeing our data should be questioned.

I agree. That data should stay with their rightful owners - the data subject.

I believe that we have fundamental issues with personal data ownership in the web for two reasons:

1. People do not believe web is real life. In the physical world, it is very easy to see how your rights are violated. If there's a person following you for days (when you shop, when you buy your medicine, when you talk to friends), you call the police. The majority of users have no idea of third party trackers. They have no idea what (or rather how much) information they are emitting at each point in time.

2. People have access to amazing products for "free", and they do not know the price of their data. Imagine if you were given a TV, but you would have to babysit a boring kid 2 hours a day for years. I guess not many would want that TV. Force each company to offer two plans (legally): one free + (ads / tracking), one premium (no ads / no tracking), then see how much people care about their data.

> Force each company to offer two plans (legally): one free + (ads / tracking), one premium (no ads / no tracking), then see how much people care about their data.

I like this idea but only if you force companies to price the non-free version no higher than the actual monetary value of the data they collect. Otherwise you'll have Gmail's "We'll stop reading your private emails" plan priced at 100,000 a month. I don't want to create a situation where only the ultra wealthy can protect their right to privacy.

Is there a meaningful distinction between a government that runs a search engine and a government that compels search engines to remove results it doesn't like and forces them to hand over every scrap of data it collects?

While the approach mentioned in this article seems interesting, there's multiple other approaches to mitigate privacy issues in data collection, cryptographic approaches like homomorphic encryption or decentralized approaches like federated learning. Still wonder what is the differentiation of this approach from the others.

[Disclaimer: I work at Cliqz]

Yes, we are not the only approach. Even when we started collecting data back in 2014 it wasn't the only one. But we did not find any suitable off-the-shelf solution back then.

Homomorphic encryption was discarded because some data to be send needs to be on the clear. For instance a url we need to fetch, so cannot transform it. Also, computationally is very expensive.

Federated learning is actually closer to what we do, take our approach as federated learning where each node is a single user and where all aggregation of records need to happen there, so that record-linkage on the final collector is impossible.

Tomorrow and the day after tomorrow we are releasing the technical details on different blog-posts, this one was more a motivation/introduction to the main dish.

Thanks a lot for the clarification. I'm looking forward to reading the upcoming articles. We really need more cases and studies on privacy preserving data collection practices. Hope your efforts will help stimulating this trend!

How about differential privacy?

Differential privacy is more about preserving privacy from consumers of data (e.g. a user running a query) -- it doesn't have guarantees around what sort of data is collected (e.g. google could expose a differentially private API, but internally they would still have all of your data).

Edit: after considering it more, I realize you can certainly apply some mechanisms from differential privacy to play a part in the data collection schemes.

seems to be like day 2 of the advent.

from day 1: The world needs more search engines [1].

[1] https://www.0x65.dev/blog/2019-12-01/the-world-needs-cliqz-t...

Indeed, the idea is to use each day until Christmas to unveil a new post about search, privacy and security. Today is about opening a discussion about data collection practices. It's often presented as a "all or nothing" (more often than not it's "nothing"). But data is needed to create independent and viable alternatives to the giants so maybe there is a middle ground if it's done with privacy at heart from the ground up.

Edit: For those curious about the details, we've started a new page where we gather research, talks and blog posts from the past about these topics: https://0x65.dev/pages/dissemination-cliqz.html

There should be a way to store data so that aggregate queries like "count how many different people are in a given location" can be answered without compromising individual privacy. Approaches like k-anonymity or L-diversity come to mind

The problem is that there is no incentive for businesses to do so. Even regulations like GDPR & CCPA mostly don't touch upon the privacy of first-party data. As long as you give an option to delete the data (which often is hard to implement), you don't have to do anything more.

Hi soumyadeb,

It is true that there needs to be a way to store the aggregate queries "per user". In the current approach of collection this place happens to be on Server, but aggregation per user can easily be done on client-side by leveraging Browser storage.

Approaches like k-anonymity or L-diversity are good but they tackle the problem from a different perspective - making sensitive data available for querying without revealing actual content. The approach suggested in this article talks about methodology which removes the need to collect such data in the first place.

You can also check our paper: https://www.0x65.dev/pages/dissemination-cliqz.html#GreenAna...

We will talk in detail about this methodology- Human Web and how we use to collect data for our search, without compromising users privacy and based on client-side aggregation.

Disclaimer: I work with Cliqz.

Thanks for the reference. I quickly browsed through the paper. The problem you mention seems to arise from the fact that GA is able to tie the user across different domains (about.me, depressionforum.org etc) using a shared cookie etc. Is it still an issue if 3rd party cookies are blocked and the GA is forced to set a first-party cookie. In that case, the ID GA will get for about.me would be different from depressionforum.org? Assuming the owners for these websites are different and only care about their respective stats (depressionforum doesn't care about about.me analytics), why do we need 3rd party cookies?

Am I missing something here?

Well, it can probably fingerprint the browser but there is no reason to store that information.

OK got it. You guys @ Cliqz want aggregate stats across websites so GA is not the right example.

Depending on the aggregates you want, client-side aggregation may still be leaking privacy. You would probably need to implement differential privacy on top before you store the data.

Of course client-side aggregation can still leak privacy, it needs to guarantee that there are no explicit or implicit elements that would allow record-linkage on the server-side. On the diff. privacy, note that you mention before you store the data, the problem is not only there, we want to prevent the data to be send at all. Diff privacy on the client-side is tricky if distributions are unknown. That's why we go for a "simpler" approach, all records send by any user should be unlikable, always. If aggregation is needed to satisfy the use-case, it can only be done on the client itself. Re-identifiability only becomes possible if a mistake is done, no a priori distributions are needed. IMHO this approach is easier than diff. privacy at the cost of being less expressive. Diff. privacy data allow for multiple use-cases where we do not (as all records) have to be independent from one another.

I'm sure it's obvious that I work at Cliqz and on this very topic :-) Tomorrow there is a more technical article about our data collection, hope you like it.

Also, I would like to add that any methodology applied to protect the data of the user is welcome, does not have to be ours at all. There is one caveat though, the privacy protection has to be on origin, a solution that send data that then has to be anonymized is no good in our book; because there is no guarantee that the raw data is removed.

For our use-case we believe it was the easier way to get the data we needed while respecting the privacy of the user.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact