
A New Search Engine - chrmod
https://0x65.dev/blog/2019-12-05/a-new-search-engine.html
======
mgreg
They talk about using query logs to optimize their search results:

>Queries performed by people, if associated to a web page, serve as even
cleaner summaries than anchor text. This is because all the logic put in place
by the search engine, who resolved the query with a list of web pages, and all
human understanding and experience that led one to select the best page from
the offered result list end up embedded in the association <query, url>.

This would seem to present a "rich get richer" problem where the oldest links
that have the largest click-through tend to float to the top making it
difficult for a new result that may be "better" to appear high in the search
results. Anyone know how search engines tackle this problem?

~~~
mc987
Surfacing new content in search engines is a very challenging problem. I am
guessing they use a combination of social signals (twitter, facebook)
popularity and domain popularity amongst other signals.

~~~
pheug
Google pretty much knows (or can accurately estimate) exactly when a new
document appears on the web and how many people are visiting it, they don't
even need to rely on second hand social signals for this. They control the
web's dominant crawler (Googlebot), browser (Chrome, which sends everything
you type in the address bar to them by default), ads (Adsense) and tracking
(Google Analytics) platforms.

------
lawrenceyan
> Money and Resources : We have been lucky enough to have fantastic investors,
> who fund and help us in our journey.

Just as an FYI, this company Cliqz is owned by Hubert Burda Media, a large
media conglomerate based in Germany.

That doesn't necessarily inherently mean anything negative, but it's important
to understand the potential underlying incentives given their marketing as
such a strongly privacy oriented service.

~~~
solso
An excerpt from the 1st post of this series: "Why would a team be motivated to
build another search engine? Why would Hubert Burda Media finance this over
several years (they continued to back us especially in times when things got
tough)?" [https://0x65.dev/blog/2019-12-01/the-world-needs-cliqz-
the-w...](https://0x65.dev/blog/2019-12-01/the-world-needs-cliqz-the-world-
needs-more-search-engines.html)

~~~
hobofan
Yes, they mention it, which is a good move regarding transparency, but they
still don't answer the question as to why the finance them.

They mostly push a narrative of privacy and censorship, when in the end the
answer is probably close to "we want a piece of the pie" or "we want to be
that monopoly".

~~~
solso
[Disclaimer, I work at Cliqz] I cannot answer for the "true" motivation of the
investors, but their pitch and actions so far are well align with the fight
against monopolies narrative. Do they want to get return on investment
(eventually)? I would assume so, and I believe it would be fair. I do not see
them as mutually exclusive. Of course, this is my personal opinion.

~~~
AJ007
It’s a good investment as a hedge in the case the EU regulators kick Google in
the ass hard enough.

~~~
lolive
The funny thing is that QWant is currently challenged on its ability to
monetize its search engine. Short answer (for the moment): it can't.
[https://www.lemonde.fr/economie/article/2019/12/04/le-
google...](https://www.lemonde.fr/economie/article/2019/12/04/le-google-a-la-
francaise-en-quete-de-fonds-et-d-un-patron_6021607_3234.html) (in french)

------
stanislavb
I don't know how many of you tried the engine, but there are 2 features that
instantly took my attention:

1) Trackers Stats. Essentially, you can see how many and what trackers there
are on the page you are about to visit. Before visiting it.

2) Page previews (I'm not sure about whether I like that)

~~~
__ka
> 1) Trackers Stats.

This feature is powered by another project we run, where we measure the
tracking landscape in the web (most popular domains):
[https://whotracks.me](https://whotracks.me). Details on how that works can be
found in our paper [0]. Also - we are flirting with the idea of providing a
mode where the ranking is informed by the trackers in the destination site.
Would love to hear your thoughts on whether you'd like smth like this.

> 2) Page previews (I'm not sure about whether I like that)

At the moment it's only a placeholder for a lengthier title and description
(if available), but we are planning to use the space for rendering a short
summary of the content/media in that site + similar sites in terms of content
(query-relevant of course). This is more work in progress as we want to make
sure content creators are on board. Again: would love to hear your thoughts on
that.

Disclaimer: I work at Cliqz.

[0] WhoTracks .Me: Shedding light on the opaque world of online tracking -
[https://arxiv.org/abs/1804.08959](https://arxiv.org/abs/1804.08959)

~~~
martinlaz
> we are flirting with the idea of providing a mode where the ranking is
> informed by the trackers in the destination site. Would love to hear your
> thoughts on whether you'd like smth like this.

That's definitely the right way to go. I would also very much appreciate an
option not to show in the result list sites/pages having any trackers.

~~~
trisch
I'm afraid this will remove any results from page :-D

~~~
martinlaz
Is it really that bad? I surely hope it's not.

I noticed that even Wikipedia is reported as having some trackers. But when I
looked closer I noticed that most of those belong to the Wikimedia foundation,
which is fine. I mean, I don't mind site owners tracking what I do on _their_
site, I just don't want to be followed across the whole Web.

The rest of the Wikipedia trackers are supposed to be Google fonts and
statics, but I couldn't witness any calls to those. Maybe the stats are not
quite up to date?

If such a score is to be given, it better be fair and reflecting the current
state of affairs.

~~~
sammacbeth_
Hi, I work at Cliqz on our Anti-tracking system, and the WhoTracks.Me data
that powers these stats on the search page.

These stats are updated monthly, and based on millions of loads of each site.
The WhoTracks.Me page for wikipedia.org
([https://whotracks.me/websites/wikipedia.org.html](https://whotracks.me/websites/wikipedia.org.html))
shows that the Google Fonts and Google Static trackers occur very infrequently
(<2% of pages), so may be on some part of the site that you did not visit.

While the Wikimedia tracker may seem innocuous, they do set a cookie that is
sent in third-party contexts, and have presence across several sites beyond
Wikipedia (133 of the top 10k)
([https://whotracks.me/trackers/wikimedia.org.html](https://whotracks.me/trackers/wikimedia.org.html)).
Theoretically, they could track user sessions across these sites. In reality
this is likely an oversight in the server configuration, but objectively this
profile looks no different to that of a legitimate tracker.

~~~
martinlaz
Thanks for the explanation. It makes sense now. And that's ... depressing.

------
rabidrat
What I really want, is not another search engine for contextless queries.
Except for really basic queries (which Google/etc already do a good job at),
I'm trying to answer a question, perhaps open-ended, and it will take multiple
queries to resolve. And it's not a linear process of narrowing down with + or
- keywords. It's establishing a context: I'm searching for something relevant
to "go" the language, not "go" the english verb or "go" the board game.

I want to be able to open a jupyter-like notebook with the start of my search
query, and the first step should be to show me the available eigencontexts,
from which I can establish the gross context for my entire search. After this
first click, none of the results should be about the board game or the english
word--unless the relevant search results happen to include an implementation
of Go the board game in Go the language.

And then when I'm done, I want to name and archive that notebook so I can
return to it at a later date--whether to refresh my memory of the ultimate
answer, or to continue the search.

I guess I would call this a 'research engine' instead of a search engine.

~~~
solso
[Disclaimer: I work at Cliqz]

I hate to answer this one, becasue it looks too much marketing-speech but this
feature exists. Not on beta.cliqz.com but on the drop-down search on Cliqz
browser.

Based on the tabs you have opened, different query expansions are selected.
For instance as you type "hotel in ma..." probably would show you results for
Mallorca, but if you have "Madrid" on a tab, then it will show results for
"hotel in madrid".

There will be a blog post about this contextual search because it's our
showcase that is possible to do personalization without compromising privacy.
All this is done privately, the browser receives results for multiple
expansions and can chose which one to display based on local information. We
never track or collect sessions of users.

~~~
amelius
Ok, but can't you then figure out what the browser was displaying using
Javascript?

~~~
whatshisface
They could, but then everyone would see them do it, and kind of the whole
point is that they won't do it.

~~~
amelius
But who in their right mind would allow the Javascript code of one tab to
access data in other tabs?

------
nostromo
The nationalism on the homepage is a little odd... in particular since they're
still essentially building on top of Google.

[https://cliqz.com/en/](https://cliqz.com/en/)

> Europe has failed to build its own digital infrastructure. US companies such
> as Google have thus been able to secure supremacy. They ruthlessly exploit
> our data. They skim off all profits. They impose their rules on us. In order
> not to become completely dependent and end up as a digital colony, we
> Europeans must now build our own independent infrastructure as the
> foundation for a sovereign future. A future in which our values apply. A
> future in which Europe receives a fair share of the added value. A future in
> which we all have sovereign control over our data and our digital lives. And
> this is exactly why we at Cliqz are developing digital key technologies made
> in Germany.

~~~
autoexec
There's been a lot of discussion about how US-centric the internet is in
general even discounting how many massively popular internet companies are US
based. I don't think it's unreasonable for Europeans or other nations to try
to be less dependent on the US and US based services. As an American, I think
it's the smartest thing they can do and I welcome it.

~~~
Semiapies
As far as that goes, I agree.

I have an uncomfortable feeling, however that this is when the walls really
start going up in the internet, beyond just the dictatorships.

------
jsnell
Ah, yes. The old "hiybbprqag" method. Worked great for Bing.

~~~
cpeterso
TIL:

 _" Hiybbprqag?" How Google Tripped Up Microsoft_

[https://www.cbsnews.com/news/hiybbprqag-how-google-
tripped-u...](https://www.cbsnews.com/news/hiybbprqag-how-google-tripped-up-
microsoft/)

------
robgibbons
I like Cliqz and am already impressed with the results for several of my test
queries, though they certainly have a ways to go.

But what I would really like to see, as has been mentioned in other threads,
is an open-source or community-funded search engine. Something that "belongs
to the web" itself, so to speak, and not to any particular corporation.

~~~
BlackLotus89
Like yacy? [https://yacy.net/](https://yacy.net/)

Tried many times to use it but since it was a resource hungry java application
that required me to use the web through a http proxy to contribute it wasn't
really useable for me at any time. Also the search results were mostly garbage
for me.

------
z3t4
If you are going to make a new search engine, you need to attack a problem
that people have, like Duckduckgo solving privacy issues. I don't want to
install something that collects a bunch of personal info about me. A better
idea is to search bookmarked sites and the cache. And do it locally.

~~~
ricardo81
To be fair, Mojeek addressed the search engine privacy problem long before DDG
existed.

------
fractalf
Wow, just wow. I've been working with Odoo for a couple of years now and it's
been a frustrating experience because it's documentation suck badly and it's
really freakin hard to get relevant answers from DuckDuckGo or Google when
stuck. I tried out a search now on Cliqz and can't believe how good and
relevant the result was. Could be a lucky shot, but I'm definitly gonna try
this out more. Great work guys! :)

------
topicseed
Anybody knows Cliqz's database stack? Curious to see what powers a large scale
information retrieval index of this sort.

~~~
ssubu
[Disclaimer: I work at Cliqz]

We will have a blog post tomorrow on this very topic, but in short, we use a
combination of Keyvi, Granne (both in-house) along with Cassandra and RocksDB.

Though our approach mentioned in this blogpost significantly reduces the
storage needed to host the index, we still have an index of around 50 TB of
data.

------
superkuh
There was another one of these posts a week or so ago. In that one I was one
of many that complained the search engine was unusable without javascript
enabled.

Now you can search without javascript enabled. Thanks, cliqz devs.

~~~
__ka
It's still not perfect, but should be usable. Many thanks for the feedback :))

------
marcell
Do you (founders/employees) have any example queries that don’t work well with
Google, and correspondingly what pages you think should be top ranked for
those queries?

~~~
stevenicr
'sex chat' currently, from my location, 'free chat now' has 2 of the top 3
results. 'i sexy chat' has 2 of the top ten results.

'chat i w ' is number 13, and has been top 10 for much of past couple / few
years.. yet they are 'not an adult site' since they run GGL ads...

should be top 5 again.. sexchatsexchat.com has way more content and history..

there are many more sites I could suggest that actually have chat systems
running (unlike the porn dood site which is a top 20 link list)

there are many good sites that aren't even in the results at all... these are
being gamed by well connected linkers, not ranked by amount of content and
length of time people would stay and enjoy.

imbo - in my biased opinion, I have more to add but wonder if it does any
good.

a new engine that handles adult better, I would help with.. the other sites
listed here do not do these results justice either.. again imbo, ymmv.

------
cirno
Since the Cliqz devs are here, and this engine is based in Germany, a
question: does your search engine have any mechanisms for reporting abusive
URLs (doxxing, targeted harassment, revenge porn, etc) beyond right-to-be-
forgotten, or are you more a lassiez-faire, everything-goes kind of search
company?

I noticed that your engine ranks some of the nastier sites on the internet far
higher than any other search engine I've looked into.

~~~
trisch
[Disclaimer: I work at Cliqz] Yes, there is a way to report such urls
[https://cliqz.com/en/report-url](https://cliqz.com/en/report-url)

We do have a list of blacklisted urls/domains mostly regarding adult topic
(child porno etc). If you have noticed some bad sites in our results, please
feel free to drop a line to our support team using link I provided

~~~
cirno
Thanks for the reply. A bit disappointed it only counts for extremely illegal
content. There's a lot of really negative stuff out there that is blatantly
false and manipulative (Ripoff Report, Tumblr callout posts, etc) and it's
always a shame that this kind of negative toxicity gets promoted so high in
SERPs.

I'd really like it if there were an ethical SERP that at least had some
integrity with its results. Reporting factual unflattering statements is one
thing (and ideal), but promoting libel feels really dirty, and so far Cliqz
seems to be the worst at that of any search engines I've used, and your
reporting link seems as though Cliqz is okay with that.

------
cocktailpeanuts
I would like to read this but I can't reach the web server. Is it just me?

Had the same issue with another article from this same site a couple of days
ago. Looks like everyone else is able to read it but for some reason not me.

Anyone know what's going on?

~~~
kkm
Hi,

Interesting, could you tell what's the error to see.

Other ways you can reach the blog: If you use Tor browser can you try opening:
[http://cliqzdevxo33b4h6.onion/](http://cliqzdevxo33b4h6.onion/)

Or if you use Beaker browser:
dat://ee172d7cd9235b2cf86ea9481e8a40e48cea29c743036621edc79a4765aa0281

Disclaimer: I work for Cliqz.

~~~
cocktailpeanuts
I get the following error:

This site can’t be reached0x65.dev refused to connect.

Try:

Checking the connection Checking the proxy and the firewall
ERR_CONNECTION_REFUSED

This happens on Chrome, Firefox, Safari, and Opera on my Mac.

~~~
kkm
First, let's check if you can open another domain on .dev TLD, like web.dev,
if not then:

Seems like you have some mapping for .dev TLD. Assuming based on your mention
of Safari, that you are using Mac.

Could you check if you have some setting in your /etc/resolver for dev TLD, or
if you are using some service like dnsmasq which is trying to resolve .dev to
a non-existent location.

~~~
cocktailpeanuts
Oh that's weird, I have "nameserver 127.0.0.1" under /etc/resolver/dev

I am on mac but I didn't touch anything. Is this how mac ships by default? Or
do you think some app may have created this file?

------
dsunku
> The experts, who chose to answer, suggested that we should first start with
> crawling the whole web. We were told that this would take between 1 and 2
> years to complete, and would cost a minimum of $1 billion

Why are costs so high for crawling?

~~~
theblackcat2004
Wouldn’t common crawl content be enough? If not what are the issues?

~~~
trisch
No, it's not enough and had poor coverage outside of USA. We have also
answered this question (it appears to be popular) in today's post about
technical details of our search [https://0x65.dev/blog/2019-12-06/building-a-
search-engine-fr...](https://0x65.dev/blog/2019-12-06/building-a-search-
engine-from-scratch.html)

------
disordinary
I remember when Cuil launched about 10 years ago they suggested that 1% of the
search market was worth a billion dollars so it's big money if you can get
inroads. Of course search is probably less important now than it was back
then, with discovery happening on social media more and more, but the internet
as a whole is much larger than 10 years ago so I wouldn't be surprised if 1%
is worth more now days.

------
ragerino
We need a smarter search.

One that is based on analyzing the content of a page then on it's page rank.

Self speaking that it has to be open source.

Apache SOLR would be a good starting point.

------
BlackLotus89
Cliqz nearly made me stop using firefox a while back
[https://www.heise.de/-3852129](https://www.heise.de/-3852129) (german
article)

Tldr 1% of german firefox installations automatically uploaded search queries
to cliquz. I wont trust a search engine like this with any of my data.

~~~
autoexec
Would you trust google who does the same thing with chrome? Bing which does
the same thing with IE (or whatever it's called now)? Blame firefox for
selling out their users not the search engine. I didn't close my account with
amazon when Ubuntu started sending searches to them, I just switched distros.

~~~
nostromo
There's a difference.

When I use a service from Google, I expect that my data will be parsed by
Google. And I can decide if I trust Google or not.

But Firefox sending the urls I visit to a third party (Cliqz) silently and
without permission is shady and deceptive.

And then, after all this, Cliqz claims that it's a company built on privacy...
sheesh.

~~~
galaxyLogic
An interesting quote for their article: " Philosophically, we believe copying
is a loaded term, we prefer to use the term learning. Learning from each other
is something all of us do"

What is the difference between copying and learning?

------
amelius
What I miss in this post is a list of references to the _huge_ literature that
exists on this topic, and related fields such as NLP.

Also I don't see a clear problem description. What _is_ a search engine,
really? How would you compare the quality of two search engines, objectively?

------
charlesism

       > Why the second constraint? one might 
       > ask. Besides the obvious potential for 
       > profitability, our mission was
    

The search engine the world needs is one with independence and _non-
profitability_. If the creators are preoccupied with turning a profit, they’ll
introduce the same garbage features as Google. It’s a shame, because a good
search engine could shorten the time humanity has to wait for advances (eg:
cures for cancers, cheaper energy, etc)

~~~
erulabs
Indexing the web is a resource intensive activity tho - if it was federated
then the resource cost only increases. I suppose a non-profit is the
alternative, but non-profits are not exactly independent unless they have some
sort of massive endowment. I'm not trying to disagree with you, it's just a
paradoxical problem: to resolve the issue, resources must be accumulated.
Accumulating resources means it's hard to resolve the issue (of an independent
search engine).

------
jayess
Weird, your domain (cliqz.com) was blocked by my pihole.

~~~
philippclassen
(Disclaimer: I work at Cliqz) We had problems with being blocked in the past.

In cases, where we got a chance to explain, they agree that it is a false
positive and took us off the block list. At least, that happened so far in all
cases that I'm aware of. However, there are so many lists that it is hard to
keep track of them. Would be nice if you could provide some information which
block list it is, so we can contact them.

The reason why we end on the blocklist is normally a misconception of our data
collection system Human Web: [https://0x65.dev/blog/2019-12-03/human-web-
collecting-data-i...](https://0x65.dev/blog/2019-12-03/human-web-collecting-
data-in-a-socially-responsible-manner.html)

If someone does not want to send Human Web data, the feature can also be
disabled through the UI. Same if you browse in a private window; Human Web is
automatically disabled there. There is no need to configure blocking rules.

------
yurisokolov
There is a dark side to this story. With Burda
[https://en.wikipedia.org/wiki/Hubert_Burda_Media](https://en.wikipedia.org/wiki/Hubert_Burda_Media),
the same people who are behind the Cliqz search engine were originally also
behind the German Leistungsschutzrecht.
[https://en.wikipedia.org/wiki/Ancillary_copyright_for_press_...](https://en.wikipedia.org/wiki/Ancillary_copyright_for_press_publishers)
This law, heavily lobbied for by publishers, forces every search engine and
everybody else using content from the internet to pay a private tax of 6% of
the revenue (not from profit!). [https://www.vg-media.de/de/digitale-
verlegerische-angebote/f...](https://www.vg-media.de/de/digitale-
verlegerische-angebote/fragen-und-antworten.html) As the profit of most
internet companies is below this margin, it is essentially forcing many
companies out of business.

This tax is enforced and collected by VG Media, the German collecting society
representing rights of a group of German publishers. [https://www.vg-
media.de](https://www.vg-media.de) Between 2013 and 2016 Burda was a
shareholder of VG Media, which was commissioned to enforce the tax in its
name.

The evil thing of this law is, that the publishers are not required to mark
their content in machine-readable form as paid content. And a manual selection
is infeasible for internet-scale with billions of pages. So a search engine
has no means to bypass the paid content and indexing only free content, e.g.
like Wikipedia which makes the majority of the internet content. Essentially
the "Leistungsschutzrecht" takes the free content hostage to extort money for
using the internet, even if you don't use paid content of the publishers (the
just 200 publications the VG Media represents).

So while Burda's Cliqz write on their blog "The world needs more search
engines" [https://www.0x65.dev/blog/2019-12-01/the-world-needs-
cliqz-t...](https://www.0x65.dev/blog/2019-12-01/the-world-needs-cliqz-the-
world-needs-more-search-engines.html) they supported a law that made it
impossible for many search engines to operate in Germany (and in the EU via
the similar EU law "Extra copyright for news sites" (“Link tax”)
[https://juliareda.eu/eu-copyright-reform/extra-copyright-
for...](https://juliareda.eu/eu-copyright-reform/extra-copyright-for-news-
sites/) And while today they are not anymore shareholder of the VG Media, they
still benefit from the suppressive legal environment they helped to create, as
it prevents any new independent competition to enter the search market

~~~
solso
[Disclaimer: I work at Cliqz]

Sorry for taking so long to reply, I was personally trying to dig some
information about this. An additional disclaimer: not a lawyer either.

Honestly, I have little idea of how this law affects search engines. What I
can say is that we are no paying anything, as AFAIK we do not know anyone who
is. Moreover, if some publisher would complain, even one in Burda, we would
stop crawling by domain, there is no technical issue here, properties are
known by the imprint. We have no say on what the investors do but I can assure
you that we have no pressure. For instance, our ad-blocker works everywhere,
regardless if the sites are from Burda or not.

On a general level, assuming that what you say is factually correct, I must
personally agree that regulation is a bitch. It's typically designed fro big
companies to control other big companies, but small ones get negatively
affected if only because of the lack of resources. We recently had to suffer
all the overhead of GDPR, which consumed a fair amount of our time, relatively
we paid a higher price that Google.

Personally, I cannot respond for all the decisions made by the people funding
Cliqz, I do not even think I can judge it either. They might be complaining
and lobbying, no idea. But they are also putting good money to build a
privacy-preserving search engine and a browser, something that no-one else is
doing, so on my account they are on the positive side.

------
Braggadocious
How many algorithms are there in chrome alone? I remember when people realized
that they could game Facebook shares for higher rankings on chrome and for a
while buzzfeed top ten lists outranked Wikipedia every fucking time. I guess
that’s still going on. What a clusterfuck search results are nowadays.

If anyone builds anything, please make it so algorithms or queries are
archived. I hate how I can’t find anything on the internet that I searched for
and found years ago. Its like the history of the internet evaporates every
year. I don’t even know if some websites still exist or if I simply can’t find
them because rankings are terrible.

I’m to the point that I haven’t been on a new website in years. How do you
find new websites in this day and age when the same websites are ranked at the
top every time?

