
Personalized news recommendations and privacy are not mutually exclusive - PaxX
https://www.0x65.dev/blog/2019-12-16/your-news-is-not-our-business.html
======
Thorentis
I think this "locally filtered, personalised content" idea has a lot of
potential, and I think that domain name filtering is just the very tip of what
can be done. Using browsing history is a great idea, but I think there needs
to be better information to filter on rather than just domain.

Consider tags. Imagine if in your history, all the tags associated with an
article were saved too. Tags aren't provided by the website? Websites should
be encouraged to provide better tags for personalisation (tags over tracking
beacons!), or the browser/browser plugin uses some kind of text processing
(machine learning?) technique to generate tags automatically from content, and
can train on the millions of already tagged articles in websites around the
place.

The trivial solution is then: increment the count for each of the tags when a
user visits a page with those tags. In top news, show the stuff at the top
with the highest tag count. When doing a search term for something ambiguous
that contains some of the tags, show the highest visisted tag results first.
e.g. if you search for "election results" and the user has a higher frequency
of "us politics" tag visits compared to "uk politics" tag visits, how US
election results higher up. (You could weight this based on current events
too, e.g. which one is more recent and so on). Over time, the plugin itself
could have a model which could train its filtering, and maybe receive global
anonymous updates that improve the algorithm. All data is local, the only
thing you need is a list of tags for each article.

~~~
ctas
> The trivial solution is then: increment the count for each of the tags when
> a user visits a page with those tags. In top news, show the stuff at the top
> with the highest tag count.

Just last week I started building a feed reader which does exactly that.

Feeds are more like entire newspapers, but maybe you're only interested in the
tech articles. You can build up a weighted distribution of keywords over time
which describe the current interests of the user, using the approach you
described. And all you need to do to make it work is read as usual.

If you want to try it out: [https://www.feedist.io](https://www.feedist.io)
It's still in beta but will very likely leave it next week.

------
user729483
> Note that the only information sent is domains—not interests, not profiles,
> just general-purpose news domains like bbc.co.uk.

Alex requests The Washington Post and CNN.

Bob requests The Wall Street Journal and Fox News.

And Joe requests PornHub.

I know the last one is not in the list buy my point is: the domain is more
than enough to profile you for profitable reasons.

Plus, the fact that this isn't addressed or even mentioned in the article is
not good.

Will we ever get a browser that dose not send anything to the
company/organization developing it for any reason/excuse?

------
danso
I was going to ask why this MVP wasn't its own plugin instead of a browser,
but Cliqz is obviously more than just news recommendation, and also includes
an initiative to build a separate search engine/index among other things.
Their about page, for anyone who is as unfamiliar with Cliqz as I am:
[https://cliqz.com/en/about](https://cliqz.com/en/about)

------
anotheryou
I wonder though if a recommendation engine is good after all.

I wrote a concept with (possibly local) curators to subscribe to and the
ability to up- or downvote them in your news stream. This way everything stays
on a 1-to-1 basis.

There is no AI judging your clicks (quantitatively I click more cat pictures
than 7-page articles, even if I prefer the latter).

And no grouping/democracy which I think does not work well for taste-stuff
(everybody can agree on cute cats and your bubble agrees on political opinions
that are obvious to you, but thought provoking, hard to digest stuff is out
again).

If anyone is interested I can send you my write up (even has a few pictures in
it!).

------
Spivak
I’m concerned with how the “noise” is implemented since it not that hard to
pick out the fixed set of real preferences given enough samples of the noisy
data.

Like as described this seems like it could be defeated by a naive top(3,
count(sites)). I guess it helps but the there’s probably more ambient noise
from NAT and changing addresses than is being added by the random options.

~~~
kkm
Thank you for taking the time to read the article and sharing your feedback.

Noise is added for what is called plausible deniability. Note that messages
themselves do not contain any user-identifier. We take extra measures to strip
request headers not needed by the server to avoid extra information that could
be used for implicitly linking messages[1]. Possibility of linking messages
based on network fingerprinting (ex: IP - which we do not log) still exists
and is an open concern which we will solve in the next version. This at most
makes it possible for us to learn that these 3 domains are visited by the same
person, again - given that the list of domains are from shortlisted top-news
domains, it is safe to assume that they do not contain any PII.

That said, it is not the strongest model we apply -- due to resource-
constrains we have not updated to strongest models like we do on more
sensitive data - via HumanWeb[2]. We will soon do the changes on a couple of
dimensions: a) each domain as separate message, right now this introduces un-
wanted spatial correlations, and b) send the domain through the proxy network
HPN[3].

Ref: [1]: [https://github.com/cliqz-oss/browser-
core/blob/7679c40aec9fe...](https://github.com/cliqz-oss/browser-
core/blob/7679c40aec9fe5dd5df590aab4a12b9627563068/modules/core/sources/request-
sanitizer.es) [2]: [https://www.0x65.dev/blog/2019-12-03/human-web-collecting-
da...](https://www.0x65.dev/blog/2019-12-03/human-web-collecting-data-in-a-
socially-responsible-manner.html) [3]:
[https://www.0x65.dev/blog/2019-12-04/human-web-proxy-
network...](https://www.0x65.dev/blog/2019-12-04/human-web-proxy-network-
hpn.html)

Disclaimer: I work for Cliqz.

------
DyslexicAtheist
> _Enlarge the time frame for recommendations (our current product is based on
> the standard 24-hour news cycle, which makes sense if you consider the
> ephemeral nature of news data, but we want to evolve our news—to a content
> recommendation product)._

it's impossible to keep the promise of delivering quality content, when their
sourcing for the "What you need to know / Top News" feature submits itself to
the rules of a _24 hour news cycle_ [1]:

[1]
[https://en.wikipedia.org/wiki/24-hour_news_cycle#Critical_as...](https://en.wikipedia.org/wiki/24-hour_news_cycle#Critical_assessment)

... and ...

 _> footnote: The first version of Cliqz News recommended articles to users in
the form of a web application by harnessing the power of the crowd. Topics
were represented by large groups of Twitter users with influence in a
particular topic. The content circulated within these groups was then
classified and ranked, allowing us to make predictions about stories you might
want to read based on detected similarities with other users’ interests_

They realized they could predict what will become newsworthy by looking at
twitter echo chambers. Twitter is awesome because we get to watch how
(especially Western) news break, and how larger established outlets cover the
same story, and (with some manual digging) follow discussions/reactions in
different echo chambers and how these continue to affect the coverage. Twitter
feels (or in the past felt) like grassroots journalism. The problems of the
24hr-News-Cycle aren't created by Twitter and already existed with tabloids.
But Twitter amplifies the problems mentioned on Wikipedia tremendously, while
even creating some new ones. Cliqz doesn't look like it is going to solve any
of this, instead they assume nothing is wrong with how things work currently.
Cliqz inability to think about this problem, while at the same time building /
peddling services on top of it, is a red flag imo.

EDIT: the wikipedia page has a brilliant book[0][1] linked at the bottom
("Skyful of Lies & Black Swans") which I highly recommend:

[0] summary: [https://reutersinstitute.politics.ox.ac.uk/our-
research/skyf...](https://reutersinstitute.politics.ox.ac.uk/our-
research/skyful-lies-black-swans) [1] pdf
[https://reutersinstitute.politics.ox.ac.uk/sites/default/fil...](https://reutersinstitute.politics.ox.ac.uk/sites/default/files/2017-12/Skyful%20of%20Lies%20%26%20Black%20Swans.%20The%20new%20tyranny%20of%20shifting%20information%20power%20in%20crises.pdf)

~~~
PaxX
Hi, [disclaimer - I work at Cliqz]

Cliqz News covers the “news cycle” consisting of “the media reporting on some
event, followed by the media reporting on public and other reactions to the
earlier reports” mentioned in the Wikipedia page you base your comment on, by
the nature of the product. As you might want to read what you "need to know"
are most relevant news events at any time of day. We look 24hours back to get
the all news articles from hand-picked, trusted news domains in a country and
follow event developments. the process is started every hour. There are no
social signals involved in the development of this product.

Below the product statement again for further clarification “The product
translates into a limited, per-country list of news articles, aiming to keep
Cliqz users informed about current events. All articles originate from hand-
picked, well-respected and trusted news outlets in a particular country. In
order to determine the most relevant news, we first collect and process all
articles that were published by those outlets during a defined timeframe,
which we then compare among each other to form clusters around current events.
The impact of these events, their prevalence on news sources, their presence
on homepages and their times of publication all help to curate a list of Top
News—updated hourly or every time a major story breaks.”

------
kgdinesh
Now only if there was a way to take this across channels.

------
meritt
Cliqz is currently gaming the shit outta HN with submissions every day this
month, many staying on the front page for hours:
[https://news.ycombinator.com/from?site=0x65.dev](https://news.ycombinator.com/from?site=0x65.dev)

------
Rerarom
I like how the domain immediately suggests that the site is not bloated.

~~~
neonate
Sorry for missing the point, but how does the domain suggest that?

