
The Architecture of a Large-Scale Web Search Engine, Circa 2019 - nikk699
https://www.0x65.dev/blog/2019-12-14/the-architecture-of-a-large-scale-web-search-engine-circa-2019.html
======
wpietri
I hadn't heard of it, but apparently Cliqz is a search engine [1] and browser
[2] built by a German media company [3].

[1] [https://0x65.dev/blog/2019-12-01/the-world-needs-cliqz-
the-w...](https://0x65.dev/blog/2019-12-01/the-world-needs-cliqz-the-world-
needs-more-search-engines.html)

[2] [https://en.wikipedia.org/wiki/Cliqz](https://en.wikipedia.org/wiki/Cliqz)

[3]
[https://en.wikipedia.org/wiki/Hubert_Burda_Media](https://en.wikipedia.org/wiki/Hubert_Burda_Media)

------
prox
How does a new engine find webpages at start? Does it work from the Dns system
and indexes every domain name? At a certain point it will follow links I
presume, but how does it start?

~~~
ssubu
[Disclaimer: work at Cliqz] We do not crawl the web in the traditional sense,
our search was bootstrapped on query logs. It is the very reason we could
succeed in building a search engine with minimal resources, in comparison to
our competitors.We have written about this in a lot more detail here :

How we collect data : [https://www.0x65.dev/blog/2019-12-03/human-web-
collecting-da...](https://www.0x65.dev/blog/2019-12-03/human-web-collecting-
data-in-a-socially-responsible-manner.html)

How we build the search using this data:
[https://www.0x65.dev/blog/2019-12-06/building-a-search-
engin...](https://www.0x65.dev/blog/2019-12-06/building-a-search-engine-from-
scratch.html)

Feel free to peruse these posts and ask questions!

~~~
ThePhysicist
Does really all of your data come from the human web project or do you also
buy clickstream data from data brokers?

~~~
ssubu
We speak about this is much more detail in this post
([https://0x65.dev/blog/2019-12-05/a-new-search-
engine.html](https://0x65.dev/blog/2019-12-05/a-new-search-engine.html)), but
in short, we prototyped our search initially with data we purchased from data-
brokers. Once the concept was proven and HumanWeb was deployed (2015/2016), we
rely only on our data.

------
leeoniya
last time Cliqz came up on here was not in the best of contexts...

[https://old.reddit.com/r/firefox/comments/74yo19/cliqz_and_m...](https://old.reddit.com/r/firefox/comments/74yo19/cliqz_and_mozilla_as_i_understand_it_and_metadrama/)

~~~
pythux
The "last time Cliqz came up" can be found here:
[https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...](https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=0x65.dev&sort=byDate&type=story).
We have been posting multiple articles on our tech blog, explaining what we do
and how we do it in great details. Your link points to an old thread of more
than two years ago.

There were more recent discussions about Cliqz no latter than this month, in
particular here:
[https://news.ycombinator.com/item?id=21676252](https://news.ycombinator.com/item?id=21676252)

[disclaimer: I work at Cliqz]

~~~
leeoniya
yeah sorry, should have said "last i remember".

> We have been posting multiple articles on our tech blog, explaining what we
> do and how we do it in great details.

it's possible to have both great tech and loose morals - the two are not
mutually exclusive, and one does not absolve the other (e.g. facebook's social
experiments)

has there been a followup to any of the points brought up in the reddit
thread?

~~~
solso
[Disclaimer: I do work at Cliqz]

There is plenty of documentation on data collected (see first posts regarding
Human Web on the tech blog), how anonymization works, why record-linkability
on data collected is prevented (and forbidden), etc. Furthermore, source code
can be inspected, as well as traffic in the case documentation is not enough.
I believe that is a better proxy to assess "morality" than random accusations
on reddit or opinions formed solely on a half-baked press releases.

Do we need to refute all miss-conceptions and FUD that might arise due to the
fact that 1) we collect data to build our services (search) and 2) we are
funded by a media company (VCs seem to be more pure for an unknown reason).

The answer is no. Cannot recall who said that it takes much more effort to
refute BS than to generate it. (That does not go for your comment in
particular, that's why we replied, but for many of the comments and some of
content of the subredit that you mention.)

~~~
neiman
> There is plenty of documentation on data collected (see first posts
> regarding Human Web on the tech blog), how anonymization works, why record-
> linkability on data collected is prevented (and forbidden), etc.
> Furthermore, source code can be inspected, as well as traffic in the case
> documentation is not enough.

Question is, is it opt-in data collection or do you make the choice for me? If
it's opt-in, great. Otherwise, I don't want to read your "plenty of
documentation" and so on and so forth.

~~~
rewq4321
You can apparently opt-out. On mobile but I think it was in one of their blog
posts.

~~~
neiman
I really rather opt-in:-)

------
ksec
If I remember correctly, Yahoo Open Source their current and next generation
Search Engine Vespa [1], why wasn't that used, and instead starting from
scratch?

[1] [https://vespa.ai](https://vespa.ai)

~~~
netankit
Vespa is a very interesting project. But, it came quite late for us (Sept.
2017 [1]).

Work on Cliqz Search started way earlier ~2013. Our work on Kubernetes and
modernizing our architecture was also started around year 2016.

[1] [https://www.verizonmedia.com/press/open-sourcing-vespa-
yahoo...](https://www.verizonmedia.com/press/open-sourcing-vespa-yahoo-s-big-
data-processing-and-serving-eng/)

~~~
lowdose
Did you guys choose Go over Java?

------
markpapadakis
I like how the blog post title is, likely, based on Google's Page and Brin
seminal paper "The Anatomy of a Large-Scale Hypertextual Web Search Engine".
:)

This blog post, and other in the series, mention RocksDB is used for the
index, but it's not explicitly described how and to what end. I 'd love to
know the details.

------
fessguid
Thank you for great article. I've been writing my own search engine during
last 3 years and it's funny how similar my setup is to yours with
K8S/Kafka/Streams/Go/RocksDB. Actually about RocksDB - are you using it from
Go(via gorocksdb?). Now and then I have hard time optimising RocksDB and still
have very loose understanding how much RAM it will consume

------
streetcat1
Just be careful. Kubeflow is heading toward GCP only features (for example,
they are dropping cert-manager), while you are betting on AWS.

~~~
yowlingcat
Would it be fair to say you're saying that Kubeflow may become eventually
incompatible with EKS?

~~~
streetcat1
Yes. Or more likely there will be a version of kubeflow for EKS. Alas amazon
is pushing sage maker.

Note that kubeflow R&D is all done by google cloud.

------
bkyan
I wonder if they have any plans for an API...?

