
Building a Search Engine from Scratch - jpschm
https://0x65.dev/blog/2019-12-06/building-a-search-engine-from-scratch.html
======
bla3
That Cliqz is trying to actually build a new search stack is commendable. This
is way more exciting to me than DuckDuckGo and other services that just
package up Bing search results under different branding.

I'm skeptical that they'll be successful, but I wish them the best. They
should market (and engineer) strongly on privacy since that's where Google is
weak.

~~~
mrb
« _DuckDuckGo and other services that just package up Bing search results_ »

Had to look that up. I found
[https://help.duckduckgo.com/results/sources/](https://help.duckduckgo.com/results/sources/)
Bing is just one of "hundreds of vertical sources delivering" results to
DuckDuckGo.

~~~
mda
Bing is the primary source though.

~~~
gregschlom
I'm not sure that's true. If you search the same query on both Bing and DDG,
you'll see pretty different results.

~~~
manojlds
I played with Yahoo BOSS a lot as a undergrad and I could tweak it to get
"better" results than Yahoo Search for certain queries. That Bing and DDG have
different results doesn't really say anything.

------
kick
Cliqz still hasn't given a good answer as to why they use what's seemingly a
typo in 'analysis' to get around user tracker blockers.

[https://anolysis.privacy.cliqz.com/](https://anolysis.privacy.cliqz.com/)

As you can see in this subthread, they claim it's "anonymized" analytics,

[https://news.ycombinator.com/item?id=21718694](https://news.ycombinator.com/item?id=21718694)

which, as 99% of research suggests, isn't anonymous at all:

[https://www.fastcompany.com/90278465/sorry-your-data-can-
sti...](https://www.fastcompany.com/90278465/sorry-your-data-can-still-be-
identified-even-its-anonymized)

~~~
solso
[Disclaimer: I work at Cliqz]

Hi, I read the thread and thought the answer was good enough, but it seems
that you are not yet convinced. Let me try:

1) Here there is a list of publications regarding privacy by Cliqz (including
published scientific papers). It should have fairly easy to find it using a
search engine :-) [https://0x65.dev/pages/dissemination-
cliqz.html](https://0x65.dev/pages/dissemination-cliqz.html) Hopefully, the
paper will convince you that Cliqz privacy commitment is serious.

2) Feel free to monitor your own traffic to see whether or not we are tracking
you.

3) Honestly, if someone tells you that anolysis means anonymous + analysis,
why do you not believe it? It does not take long to find references of the
name on the source code. On a separate note, as a company (Cliqz) that offers
anti-tracking and ad-blocking, I can tell you that blocklists are a bit more
sophisticated than that.

Hope that this will address your concerns,

[comment edited: why do you _not_ believe it?]

~~~
kick
A quick skimming of those PDFs found no mentions of "anolysis." Your colleague
claimed that papers were going to be released "soon" on it. It's been at least
two years since you started using it, so why hasn't there been yet?

 _It does not take long to find references of the name on the source code._

No references in any of your published source code, and your search engine
isn't free software:

[https://github.com/search?q=org%3Acliqz-
oss+anolysis](https://github.com/search?q=org%3Acliqz-oss+anolysis)

 _Honestly, if someone tells you that anolysis means anonymous + analysis, why
do you not believe it?_

Cliqz has done unsavory things in the past (like the Firefox fiasco a few
years back, for example, which I can't fault Cliqz entirely for: Mozilla is
just as guilty).

 _On a separate note, as a company (Cliqz) that offers anti-tracking and ad-
blocking, I can tell you that blocklists are a bit more sophisticated than
that._

"anolysis" gets around both uBlock Origin and uMatrix, despite both of them
automatically blacklisting any URL with "analytics" in it, as an example.
Getting around the most popular content filterers on the internet is a pretty
strong signal.

~~~
pythux
> No references in any of your published source code, and your search engine
> isn't free software:

You looked at the wrong tab, check in "Code" to find "code" related to
Anolysis: [https://github.com/search?q=org%3Acliqz-
oss+anolysis&type=Co...](https://github.com/search?q=org%3Acliqz-
oss+anolysis&type=Code)

> Cliqz has done unsavory things in the past (like the Firefox fiasco a few
> years back, for example, which I can't fault Cliqz entirely for: Mozilla is
> just as guilty).

Not sure how this is Cliqz' fuck-up. We are not hiding anything. On the
contrary we are very transparent and detailed about how everything we do is
designed to not track users. All of this is on our new tech blog:
[https://0x65.dev](https://0x65.dev), feel free to have a look between two
comments on HN and give us some feedback!

> "anolysis" gets around both uBlock Origin and uMatrix, despite both of them
> automatically blacklisting any URL with "analytics" in it, as an example.
> Getting around the most popular content filterers on the internet is a
> pretty strong signal.

It's not called "getting around it" when there is no tracking or ads going on
(if you want to see how smart the "most popular content filterers are" check
out this link and see that the image is blocked because it contains the
substring: "analytics":
[https://whotracks.me/blog/private_analytics.html](https://whotracks.me/blog/private_analytics.html).
Wicked smart!).

Anolysis is not a typo, it's a project name, people tend to do that when they
care and spend a lot of time on projects: give them names. So, at the risk of
repeating myself, Anolysis = Analysis + Anonymous (at the time we thought it
was a pretty neat name!).

Anolysis does not operate outside of Cliqz products (no websites analytics
here and we do not rely on a third-party, we built it in-house for this
reason) and we put a lot of work into it to make sure it does not use a unique
ID (like virtually every other analytics out there) but allows to by-design
not track any single user (in fact the system does not even have the concept
of a user). Sure, we did not write extensively about it but I guess we have to
start somewhere (in December we are writing on 24 different things we do, we
will be sure to consider Anolysis as a good candidate for a technical blog
post in the future).

What you attribute to malice is simply a lack of time, as you probably noticed
Cliqz is working on solving _a lot_ of very hard problems (search, browsers,
antitracking, adblocking, privacy-preserving telemetry and so much more) and
writing a paper about the new system you designed and implemented is not
always the priority :)

------
nostromo
I don’t see how this will ultimately be successful.

They’re basically reverse engineering Google by looking at user logs.

Google will always have a leg up here because they have all the Google data.

And even if it does work for a while, there still needs to be the original
signal to copy. Someone will have to crawl the web and index content.

I’m super eager to find new approaches to search, but another Google clone is
not that.

~~~
commoner
The article describes techniques used by all search engines, not just Google.
Search engines have existed before Google, and despite Google's monopoly on
search, "Google clone" is a poor term to describe all search engines when
alternatives with unique features (e.g. DuckDuckGo) exist.

~~~
nostromo
They are literally rebuilding Google by tracking how users use Google and
rebuilding the SERPs.

If that’s not a Google clone I don’t know what is.

~~~
wizzwizz4
All search engines have search-engine results pages.

They're looking at how users use Google Search because the data's there.
They're making a competitor to Google Search. That doesn't mean they're
rebuilding Google Search's SERPs, or making a Google Search “clone”; I've got
results from Cliqz for queries I'm confident have never been put into Google
before, meaning it's functioning as an independent search engine.

~~~
pheug
>They're looking at how users use Google Search because the data's there

This. Having worked in past life for one of their competitors, can confirm -
what users click on (in SERP) is one of the most powerful signals for ranking.
And who got (almost) all the clicks in the world? Google!

That's why it's so damn hard to beat them. It's the unreasonable effectiveness
of data: more data (which they have almost all of) usually beats a smarter
algorithm, and with 20 years R&D, theirs is surely not dumb.

Do the clicks belong to users or Google, that's an interesting question,
though.

------
josephpmay
I'd never heard of Cliqz before, but just did a couple of test searches, and
I'm honestly super impressed with the results. I found the result relevancy
seemed to be closer to Google than DDG/Bing

~~~
big_chungus
Maybe so, but honestly, the name is putting me off more than anything. It just
doesn't sound professional, and causes me to perceive it as shady, even if
it's not. The name also makes it sound like it's more about marketing "clicks"
to advertisers than providing good results. None of that is necessarily true,
but it's the impression the name gives. It needs to re-brand.

~~~
tschakkaMarc
It’s the one time where I can honestly say: The discussion we’re having within
Cliqz about our brand name are even more heated and controversial than here on
Hacker News (and Reddit for that matter) ... but then there is this saying:
“Every brand name is shit until you surpass one billion users - than it
becomes brilliant”. More seriously: we do think about it a lot - happy to get
ideas.

~~~
liability
How are there even two sides to the discussion internally? I'd sooner cut off
my left nut with a spoon than trust any company named 'Cliqz.'

You might as well name yourself _" Nigerian Princes Inc."_

~~~
NicoJuicy
So what you are saying is that you trust Google more than a German/European
company, where the laws are way more strict.

Enjoy 1 nut.

( Extreme comment fyi )

~~~
liability
I don't have to chose between the two. I trust neither.

------
pcmaffey
I really just wish exact match search still worked. But now, words are all
vectorized as every search engine tries to determine "my intent", resulting in
a deluge of fuzzy matches.

Maybe I'm old school, but I don't want software that fixes my spelling
mistakes. I want software that fails when I make a mistake.

------
LeftHandPath
Honestly, what would be interesting is if there was an open source database of
crawled webpages, available for anyone to search / use with their own
algorithms. That would make it possible for... a lot of things, really.

I feel like the web parsing / indexing, perhaps rather than the search
algorithm itself, is the hardest part of rolling a new search engine (largely
due to the associated computing and storage costs).

~~~
jkaptur
There is [https://commoncrawl.org/](https://commoncrawl.org/), but it would be
really cool if there were a more well-lit path towards building the rest of a
simple search engine. For example, another commentator wanted “like Google,
but without the spelling correction”, well, spin up one of these and just stub
out the spelling module :)

~~~
LeftHandPath
Thanks for the tip! I'll have to play around with that sometime.

------
inertiatic
Interesting approach, certainly very different than what I'm used to as
someone working on search.

Will probably keep an eye on this blog.

------
lefstathiou
If you’re listening Cliqz, I (and I believe others) would pay $50-100 a month
for a search engine that gave me the ability to blacklist sites. I think this
alone could solve the biggest problem I have with google which is the
reversion to the (what appears to be regressive) mean of the internet today.

~~~
ma2rten
There is a browser extension called personal blocklist for this purpose.

------
mda
One immediate observation is non English porn filters / safe search do not
work properly. I see NSFW results mixed with others for benign words.

------
CamperBob2
How do search engines stay out of trouble with things like copyright trolls
and FBI-operated honeypots for child porn? If I had a private Web crawler, I'd
be terrified to run it.

------
citilife
Also having built a search engine from scratch, we use a similar method:
[https://insideropinion.com/](https://insideropinion.com/)

In our case the "queries" are also the index creation components. Every time
someone discusses something, we are indexing it, so you can search media,
documents, people from context. We hint at how this works here:
[https://austingwalters.com/fast-full-text-search-in-
postgres...](https://austingwalters.com/fast-full-text-search-in-postgresql/)

The downside of our approach is it needs lots of conversation data. From their
TLDR version:

"""

\- Our model of a web page is based on queries only. These queries could
either be observed in the query logs or could be synthetic, i.e. we generate
them. In other words, during the recall phase, we do not try to match query
words directly with the content of the page. This is a crucial differentiating
factor – it is the reason we are able to build a search engine with
dramatically less resources in comparison to our competitors.

\- Given a query, we first look for similar queries using a multitude of
keyword and word vector based matching techniques.

\- We pick the most similar queries and fetch the pages associated with them.

\- At this point, we start considering the content of the page. We utilize it
for feature extraction during ranking, filtering and dynamic snippet
generation.

"""

It appears 0x65 has similarly figured this out, the name of the game is
forming proper search queries. In their case, their results would be good as
soon as they start indexing and create synthetic queries. IMO might be better
for documents and what not.

Either way, interesting to compare notes! Kudos to the work.

~~~
aldoushuxley001
I remember reading your article on FTS in postgres, great stuff. Was wondering
what strategies you might be using to perform counts on your data?

I'm trying to implement a faceted search in postgres and currently using
window functions to count subcategories (a la
[http://akorotkov.github.io/blog/2016/06/17/faceted-
search/](http://akorotkov.github.io/blog/2016/06/17/faceted-search/)), but not
sure if it's the most efficient.

~~~
citilife
Depends what you want to do, I created a "estimate_count" function to make it
much much faster:

"SELECT planrows FROM estimate_row('SELECT COUNT(*) ON table WHERE XXX')"

~~~
aldoushuxley001
That's actually brilliant.

If you're ever looking for something to write about for a new blog post, I
would love to learn more about how you implemented that estimate_count
function.

Thanks for the tip in the right direction tho!

------
throwaway8879
We've had at least 2 posts from cliqz in the past few days. I have genuine
issues with my short-term memory after having recovered from a coma so I don't
know whether this is a glitch where I keep seeing the same posts or they keep
getting reposted.

~~~
dang
Your short-term memory is fine in this case, and I hope that is part of a
complete recovery.

They're doing an "Advent Calendar" series where they're posting one a day:

[https://news.ycombinator.com/item?id=21676252](https://news.ycombinator.com/item?id=21676252)

[https://news.ycombinator.com/item?id=21684708](https://news.ycombinator.com/item?id=21684708)

[https://news.ycombinator.com/item?id=21694980](https://news.ycombinator.com/item?id=21694980)

and
[https://news.ycombinator.com/item?id=21716860](https://news.ycombinator.com/item?id=21716860)
(not even a day ago).

This is a problem for HN because users here are not used to this sort of
repetition—indeed, we moderate HN explicitly to dampen repetition, because the
point of the site is curiosity and curiosity withers under it
([https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...](https://hn.algolia.com/?dateRange=all&page=0&prefix=false&query=by%3Adang%20withers&sort=byDate&type=comment)).
The more these posts show up with what for HN is a crazy frequency, the more
likely users here are to experience it as a barrage and start to complain.

On the other hand, these articles are well-crafted, contain a lot of
information, and would normally be fine HN submissions. The topic of building
a new search engine is intrinsically interesting. It also resonates with a lot
of themes that get discussed a lot on HN (concerns about big tech and so on).
So this is a different situation than the usual marketing onslaughts that HN
gets subjected to, where the content is crappy, users flag it away, and
moderators squash what users missed.

I'm not sure what to do about this yet.

------
yurisokolov
>> "Our model of a web page is based on queries only. ... This is a crucial
differentiating factor – it is the reason we are able to build a search engine
with dramatically less resources in comparison to our competitors."

>> "The total size of our index currently is around 50 TB."

Could you share your current index size (number of pages, size of raw text) to
put those 50 TB into perspective in order to get an idea how much less
resources in comparison to your competitors you need? This would help to
compare your approach to Elasticsearch, Solr, Lucene

------
3xblah
But is Cliqz a search engine, or is it a web browser with a search bar set to
a default search engine (or a Firefox extension that accomplishes the same
thing).

Cliqz bought Ghostery to acquire a pool of privacy-conscious users. The goal
is to show them ads. Not sure how excited they will be about that.

If Cliqz really is a search engine, can a user submit a query to the database
using her own choice of tcp/http client. It looks like submitting requires
first downloading and installing software from Cliqz.

~~~
ssblunder
[https://beta.cliqz.com](https://beta.cliqz.com)

~~~
tschakkaMarc
Or via Tor browser as Onion service: search4tor7txuze.onion (quite unique - if
available it will show onion services instead of www domains).

------
sterlind
Do you use Common Crawl? It seems like a pretty big corpus, reasonably up-to-
date and free. It seems like a good way to supplement the page data from the
browser extension.

~~~
sammacbeth_
From the article:

"It may seem like Common Crawl would suffice for this purpose, but it has poor
coverage outside of the US and its update frequency is not realistic for use
in a search engine."

------
caro_douglos
50TB on localhost is awesome! I wonder what the amount of new content created
was between 2018 and 2019.

~~~
StillBored
But I doubt you really need it all in RAM for a small localized single user
search engine. In that case you back most of the index with NVMe which at the
prices I saw on black friday is probably less than $10k using consumer grade
QLC flash drives when combined with cheap x1 pcie expansion boards/etc.

The 10PB of disk is also quite reachable given its possible to buy bulk 10T
disks at $150 each.

Bottom line, I did some of these calculations a couple years ago because I was
interested in a topic based search engine that only indexed for certain topics
and basically tossed any crawler results that didn't appear to fit the subject
matter.

So, while the web is a lot bigger than when google started, storage and
compute is also a lot cheaper. A web search engine that specialized in say
cooking recipes might be entirely doable on a fairly limited budget.

------
sansnomme
Can someone comment on how to use knowledge graphs for search? I have seen
some applications in NLP but I am curious how it can tie in with traditional
search.

~~~
vl
AFAIK, in Google’s case instant answer cards come from the Google’s knowledge
graph, not search results. I.e. if you see some info rendered on top of search
results or on the side, most likely it’s from the knowledge graph.

------
pbreit
Definitely interesting.

But...my searching has gotten to the point where well over half the time I am
no longer looking at Google's conventional search results.

~~~
kbyatnal
could you explain that a bit more? What are you using instead?

~~~
pbreit
Just all of Google's "magic box" results. And a lot of searching is in the
Chrome/Firefox/Safari address/search box.

------
jayess
Strangely, cliqz.com was blocked by my pihole.

~~~
kkm
Hi,

Thank you for bringing this up. Although, this is not relevant in the context
of the blog post, we are on that list by mistake. We do NOT collect any
personal data in our browser: (more details e.g., here:
[https://0x65.dev/blog/2019-12-02/is-data-collection-
evil.htm...](https://0x65.dev/blog/2019-12-02/is-data-collection-evil.html)
and [https://0x65.dev/blog/2019-12-03/human-web-collecting-
data-i...](https://0x65.dev/blog/2019-12-03/human-web-collecting-data-in-a-
socially-responsible-manner.html)) and we go a long way to make sure not even
implicit indentifiers go through. We believe we ended up on that list for a
bad Firefox experiment and we will reach out to the maintainers, make our
case.

Disclaimer: I work for Cliqz.

------
based2
[https://github.com/Qwant/](https://github.com/Qwant/)

------
debt
google's suggested autocorrect is one it's most impressive features; idk I'd
say the relevance of the search results almost comes in a near second to that.

so make a competitive "suggested autocorrect" solution and then I think you'd
have a stew going.

~~~
trisch
There is an autocorrect feature and it's even described in the article. Or you
mean something else?

~~~
ma2rten
They mean auto complete.

~~~
trisch
Here is the quote from article "This not only involves some normalization, but
also expansions and spell corrections, if necessary. "

------
rawoke083600
Might put a link to actual search engine on the article. Good article though !

~~~
trisch
[https://beta.cliqz.com/](https://beta.cliqz.com/) is EP for search engine

------
lasthacker
Is the Human Web dataset available for download?

~~~
netankit
Data release, it's not possible, but if people want to come and do experiments
on the data or try to test it for privacy, we are more than welcome to host
them. There is no formal process in any way, best effort, we have done several
times in the past. If you are very interested contact us and we will see if we
can accommodate you.

[Disclaimer: I work at Cliqz]

~~~
philippclassen
(Disclaimer: I work at Cliqz) Extending on that, let me elaborate why we
cannot open the data, not even a subset of it. We had the discussion in the
past, but for two reasons it is not an option.

Although it is anonymous data - currently we are not aware of any de-
anonymization attacks - it is still data that came from real persons. We have
a responsibility: once the data is out, we have to guarantee that no-one will
ever be able to identity a single person in the data. Take also in account
that attackers can combine multiple data sets (Background Knowledge Attacks);
that even includes data sets that will be published (or leaked) in the future.

You should never be too confident when it comes to security, neither should
you underestimate the creativity of attackers. What we can do - and did in the
past - is to simulate the scenario in a controlled environment by hiring pen
testing companies. If they would find an attack, they will not use that
knowledge to harm the persons behind the identities that they could reveal.

That is the main reason. We don't want to end up in a situation as AOL or
Netflix when they published their data. By the way, Netflix is an example of a
background attack where they needed to combine data sources.

There is also another argument. Skeptics will most likely remain skeptics, as
we cannot proof that we did not filter out data before publishing. In other
words, there is nothing to gain for us, we can only loose. Trust is important,
but for building trust, it is better to be transparent about the data that
gets sent on the client. You can verify that part yourself and do not have to
rely on trust alone. That is the core idea behind our privacy by design
approach.

Those are the arguments that I'm aware of why we will not open the data.
However, getting access in controlled environments is possible. If you doing
security/privacy research, you can reach out to us. In my opinion, having more
people that will try to find flaws in our heuristics is useful. That gives us
a chance to fix it before it can be used for attacks.

One notable exception: [https://whotracks.me](https://whotracks.me) is built
from Human Web and all its underlying data can be freely downloaded. We know
that it has been already used for research.

