Hacker News new | past | comments | ask | show | jobs | submit login
A New Search Engine (0x65.dev)
173 points by chrmod on Dec 5, 2019 | hide | past | favorite | 102 comments

They talk about using query logs to optimize their search results:

>Queries performed by people, if associated to a web page, serve as even cleaner summaries than anchor text. This is because all the logic put in place by the search engine, who resolved the query with a list of web pages, and all human understanding and experience that led one to select the best page from the offered result list end up embedded in the association <query, url>.

This would seem to present a "rich get richer" problem where the oldest links that have the largest click-through tend to float to the top making it difficult for a new result that may be "better" to appear high in the search results. Anyone know how search engines tackle this problem?

If I had a big enough user base I would give some users non optimized queries and check if the choices still matched up to previous choices, if they did I would increase rank on previous choices, if not I would start to down prioritize and increase the choices that were chosen. An ongoing test of is result X for search term y still best?

Going strictly by volume of clicks (or volume of clicks for a keyword) is not going to be fair or particularly useful, you need to compare clicks received to clicks expected.

If it's shown in position one, it should be expected to get more clicks than position ten. If something gets more than expected, move it up.

[Disclaimer, I work at Cliqz]

Your point is spot on. Old pages tend to have more association to seen queries, which does not play in favor for new pages.

That said, however, there are a couple of things to consider: 1) seen queries is not the only way to create queries, we are pretty good creating synthetic queries based on the content, descriptions, etc. This queries are more noisy that the seen queries of course, but good enough. And 2)novelty, freshness and popularity are very important features on the ranking. Feel free to try out any new topic you might think of on https://beta.cliqz.com, you will see that is not only "stale" content.

Thank you for the additional detail and this certainly appears to be a challenging problem.

It is still a little fuzzy to me. What is a "synthetic query"? Is this basically generating queries that would match the content (i.e. essentially reversing the process)?

Novelty, freshness are interesting but can lead back to the noise problem mentioned in the blog. If many pages are created that may match the query (e.g. "best new movies") many young pages will match this. Popularity would be useful but difficult to establish and then there's the clickbait and other gaming problems.

Why does Cliqz use an analytics domain with a typo in it to get around user tracker-blockers? That's incredibly scummy, given how much Cliqz has been shouting about privacy.


Anolysis stands for Ano[nymized] [Ana]lysis. It's a new approach to do telemetry without sending unique identifiers (like most analytics / telemetry) systems do - but focus on goal attainment at a group level. This makes work harder of course, but it's a price we've been willing to pay. It is a pity you would take a domain name as evidence of malice. We should have a paper coming up at some point on the approach.

Self-claimed anonymous analytics have repeatedly failed in the past: you've been using this for at least two years, why didn't Cliqz release a paper before then on it? Or even explain what it is? Or a mention on your site of it? I don't want to dislike Cliqz, what it says it's doing is cool. However, given the financial incentives involved, the verify step of "trust but verify" is more essential than ever.

Seems like an excellent opportunity to apply Hanlon's Razor.

"Never attribute to malice that which can be adequately explained by stupidity."

They've been doing it for two years at minimum:


Two years and three domains with the typo? It's malice.

This malice metric is confusing. How malacious is Google then? They misspelled "googol" for over two decades, but there's only one domain. Do we count the 1e100.net as a misspelling?

How does using an odd spelling thwart user-tracker blockers?

Many ad blockers have baseline filters that block subdomains, URLs, etc. with common tracking terms in them (ex. "telemetry" or "tracking").

Wouldn't a multi-armed bandit help alleviate this issue? (Basically, randomly display a few other links, and use bayesian stats to figure out if the new links are more optimal).

That, or any kind of exploration/optimization algorithm, to be honest.

Surfacing new content in search engines is a very challenging problem. I am guessing they use a combination of social signals (twitter, facebook) popularity and domain popularity amongst other signals.

Google pretty much knows (or can accurately estimate) exactly when a new document appears on the web and how many people are visiting it, they don't even need to rely on second hand social signals for this. They control the web's dominant crawler (Googlebot), browser (Chrome, which sends everything you type in the address bar to them by default), ads (Adsense) and tracking (Google Analytics) platforms.

[Disclaimer: I work at Cliqz]

You are on point. Recency is a challenging problem in multiple ways for search engines. Not just limited to discovering new content, but also how does one index it? How does one balance out when you have for the same query "very new", "new", "slightly old" and "really old" results during ranking. This involves both news as well as new webpages surfacing on the web.

On top of this, we have to remember that this is a fully autonomous real time system which requires solving some of the most difficult engineering challenges at scale and at the same time being mindful of the latency and quality constraints.

At the end of the day, it's all about the final user experience that we ship. We are very much mindful of the same. We will be publishing more details about Cliqz search, on our blog https://0x65.dev/ in the coming days, so stay tuned.

GPS tracking must also help Google a lot to determine mortar shop popularity.

Yes, that's an interesting problem to solve. Maybe some A/B testing and showing new results to a percentage of all searches.

> Money and Resources : We have been lucky enough to have fantastic investors, who fund and help us in our journey.

Just as an FYI, this company Cliqz is owned by Hubert Burda Media, a large media conglomerate based in Germany.

That doesn't necessarily inherently mean anything negative, but it's important to understand the potential underlying incentives given their marketing as such a strongly privacy oriented service.

An excerpt from the 1st post of this series: "Why would a team be motivated to build another search engine? Why would Hubert Burda Media finance this over several years (they continued to back us especially in times when things got tough)?" https://0x65.dev/blog/2019-12-01/the-world-needs-cliqz-the-w...

Yes, they mention it, which is a good move regarding transparency, but they still don't answer the question as to why the finance them.

They mostly push a narrative of privacy and censorship, when in the end the answer is probably close to "we want a piece of the pie" or "we want to be that monopoly".

[Disclaimer, I work at Cliqz] I cannot answer for the "true" motivation of the investors, but their pitch and actions so far are well align with the fight against monopolies narrative. Do they want to get return on investment (eventually)? I would assume so, and I believe it would be fair. I do not see them as mutually exclusive. Of course, this is my personal opinion.

It’s a good investment as a hedge in the case the EU regulators kick Google in the ass hard enough.

The funny thing is that QWant is currently challenged on its ability to monetize its search engine. Short answer (for the moment): it can't. https://www.lemonde.fr/economie/article/2019/12/04/le-google... (in french)

Isn’t Google basically an add company (if you go by their revenue). A search engine provided by a media company or ad company. I’m curious what people think the debate is between these.

Differing biases.

Choice is good.

I don't know how many of you tried the engine, but there are 2 features that instantly took my attention:

1) Trackers Stats. Essentially, you can see how many and what trackers there are on the page you are about to visit. Before visiting it.

2) Page previews (I'm not sure about whether I like that)

> 1) Trackers Stats.

This feature is powered by another project we run, where we measure the tracking landscape in the web (most popular domains): https://whotracks.me. Details on how that works can be found in our paper [0]. Also - we are flirting with the idea of providing a mode where the ranking is informed by the trackers in the destination site. Would love to hear your thoughts on whether you'd like smth like this.

> 2) Page previews (I'm not sure about whether I like that)

At the moment it's only a placeholder for a lengthier title and description (if available), but we are planning to use the space for rendering a short summary of the content/media in that site + similar sites in terms of content (query-relevant of course). This is more work in progress as we want to make sure content creators are on board. Again: would love to hear your thoughts on that.

Disclaimer: I work at Cliqz.

[0] WhoTracks .Me: Shedding light on the opaque world of online tracking - https://arxiv.org/abs/1804.08959

> we are flirting with the idea of providing a mode where the ranking is informed by the trackers in the destination site. Would love to hear your thoughts on whether you'd like smth like this.

That's definitely the right way to go. I would also very much appreciate an option not to show in the result list sites/pages having any trackers.

I'm afraid this will remove any results from page :-D

Is it really that bad? I surely hope it's not.

I noticed that even Wikipedia is reported as having some trackers. But when I looked closer I noticed that most of those belong to the Wikimedia foundation, which is fine. I mean, I don't mind site owners tracking what I do on their site, I just don't want to be followed across the whole Web.

The rest of the Wikipedia trackers are supposed to be Google fonts and statics, but I couldn't witness any calls to those. Maybe the stats are not quite up to date?

If such a score is to be given, it better be fair and reflecting the current state of affairs.

Hi, I work at Cliqz on our Anti-tracking system, and the WhoTracks.Me data that powers these stats on the search page.

These stats are updated monthly, and based on millions of loads of each site. The WhoTracks.Me page for wikipedia.org (https://whotracks.me/websites/wikipedia.org.html) shows that the Google Fonts and Google Static trackers occur very infrequently (<2% of pages), so may be on some part of the site that you did not visit.

While the Wikimedia tracker may seem innocuous, they do set a cookie that is sent in third-party contexts, and have presence across several sites beyond Wikipedia (133 of the top 10k) (https://whotracks.me/trackers/wikimedia.org.html). Theoretically, they could track user sessions across these sites. In reality this is likely an oversight in the server configuration, but objectively this profile looks no different to that of a legitimate tracker.

Thanks for the explanation. It makes sense now. And that's ... depressing.

What I really want, is not another search engine for contextless queries. Except for really basic queries (which Google/etc already do a good job at), I'm trying to answer a question, perhaps open-ended, and it will take multiple queries to resolve. And it's not a linear process of narrowing down with + or - keywords. It's establishing a context: I'm searching for something relevant to "go" the language, not "go" the english verb or "go" the board game.

I want to be able to open a jupyter-like notebook with the start of my search query, and the first step should be to show me the available eigencontexts, from which I can establish the gross context for my entire search. After this first click, none of the results should be about the board game or the english word--unless the relevant search results happen to include an implementation of Go the board game in Go the language.

And then when I'm done, I want to name and archive that notebook so I can return to it at a later date--whether to refresh my memory of the ultimate answer, or to continue the search.

I guess I would call this a 'research engine' instead of a search engine.

[Disclaimer: I work at Cliqz]

I hate to answer this one, becasue it looks too much marketing-speech but this feature exists. Not on beta.cliqz.com but on the drop-down search on Cliqz browser.

Based on the tabs you have opened, different query expansions are selected. For instance as you type "hotel in ma..." probably would show you results for Mallorca, but if you have "Madrid" on a tab, then it will show results for "hotel in madrid".

There will be a blog post about this contextual search because it's our showcase that is possible to do personalization without compromising privacy. All this is done privately, the browser receives results for multiple expansions and can chose which one to display based on local information. We never track or collect sessions of users.

Ok, but can't you then figure out what the browser was displaying using Javascript?

Not sure I get your point. But contextual search only works for the search within Cliqz browser, on the address bar dropdown, on the client space. The same approach cannot be done on the (web-page SERP page, beta.cliqz.com), because from a web we have no access to the tabs opened. It could only possible via tracking and user-profiling, which is something that we do not do, or want to do.

They could, but then everyone would see them do it, and kind of the whole point is that they won't do it.

But who in their right mind would allow the Javascript code of one tab to access data in other tabs?

I had very similar idea a few years ago, did some quick numbers and came to conclusion that it was not going to fly commercially.

Interestingly the working name for my idea was also "ResearchEngine", so i guess it summarizes pretty well the unmet need you and me have.

Agreed. tried replicating this with org mode, which required an extensive learning curve to meet the needs akin to what you listed. Although at some level it did the job, I could imagine a tool that does it better.

> I want to be able to open a jupyter-like notebook with the start of my search query, and the first step should be to show me the available eigencontexts

AltaVista used to do this. I miss it terribly.

I think you are mis-remembering this. I used Altavista heavily until Google came out and made search work, and I don't remember any feature anything like this.

AltaVista used to have a Java Applet that would draw the clusters. So "python" would get you a cluster that was "reptiles" and one that was "programming". You could then click on that cluster and it would "zoom in" on the cluster and then redraw the probability clusters again.

I believe this is called faceted search.

For the first level context, you're right. Does faceted search also allow the ability to specify "near this link"? Like a "warmer" or "colder" approach to searching, where I can train the engine during the search?

I also want to maintain a longer-term list of pages/sites to exclude. Like, unless otherwise specified, I never want results from w3schools or expertsexchange.

This field is ripe for disruption, in my opinion. We can do so much better, but I've not seen any serious attempts.

The nationalism on the homepage is a little odd... in particular since they're still essentially building on top of Google.


> Europe has failed to build its own digital infrastructure. US companies such as Google have thus been able to secure supremacy. They ruthlessly exploit our data. They skim off all profits. They impose their rules on us. In order not to become completely dependent and end up as a digital colony, we Europeans must now build our own independent infrastructure as the foundation for a sovereign future. A future in which our values apply. A future in which Europe receives a fair share of the added value. A future in which we all have sovereign control over our data and our digital lives. And this is exactly why we at Cliqz are developing digital key technologies made in Germany.

There's been a lot of discussion about how US-centric the internet is in general even discounting how many massively popular internet companies are US based. I don't think it's unreasonable for Europeans or other nations to try to be less dependent on the US and US based services. As an American, I think it's the smartest thing they can do and I welcome it.

As far as that goes, I agree.

I have an uncomfortable feeling, however that this is when the walls really start going up in the internet, beyond just the dictatorships.

I also welcome it, as an American. Hopefully it means more competition and therefore more motivation for US-based products to improve rather than stagnate.

Well, I think they're right (American here).

> The nationalism on the homepage is a little odd.

Especially seeing as how Europe is not a nation.

> A future in which we all have sovereign control over our data and our digital lives. ... And this is exactly why we at Cliqz are developing digital key technologies made in Germany.

Ahhh, that makes more sense, "sovereign" == "made in Germany."

Ah, yes. The old "hiybbprqag" method. Worked great for Bing.


"Hiybbprqag?" How Google Tripped Up Microsoft


And they even link to it on the notes of the blog post.

I like Cliqz and am already impressed with the results for several of my test queries, though they certainly have a ways to go.

But what I would really like to see, as has been mentioned in other threads, is an open-source or community-funded search engine. Something that "belongs to the web" itself, so to speak, and not to any particular corporation.

Like yacy? https://yacy.net/

Tried many times to use it but since it was a resource hungry java application that required me to use the web through a http proxy to contribute it wasn't really useable for me at any time. Also the search results were mostly garbage for me.

If you are going to make a new search engine, you need to attack a problem that people have, like Duckduckgo solving privacy issues. I don't want to install something that collects a bunch of personal info about me. A better idea is to search bookmarked sites and the cache. And do it locally.

To be fair, Mojeek addressed the search engine privacy problem long before DDG existed.

Wow, just wow. I've been working with Odoo for a couple of years now and it's been a frustrating experience because it's documentation suck badly and it's really freakin hard to get relevant answers from DuckDuckGo or Google when stuck. I tried out a search now on Cliqz and can't believe how good and relevant the result was. Could be a lucky shot, but I'm definitly gonna try this out more. Great work guys! :)

Anybody knows Cliqz's database stack? Curious to see what powers a large scale information retrieval index of this sort.

[Disclaimer: I work at Cliqz]

We will have a blog post tomorrow on this very topic, but in short, we use a combination of Keyvi, Granne (both in-house) along with Cassandra and RocksDB.

Though our approach mentioned in this blogpost significantly reduces the storage needed to host the index, we still have an index of around 50 TB of data.

[Disclaimer I work at Cliqz] There is a lot of systems under the hood, depending if it's the main index or the freshness index. But if I have to pick one as database it should be Keyvi (https://github.com/KeyviDev/keyvi).

FYI, there will be a bunch of articles regarding search in the next week.

There was another one of these posts a week or so ago. In that one I was one of many that complained the search engine was unusable without javascript enabled.

Now you can search without javascript enabled. Thanks, cliqz devs.

It's still not perfect, but should be usable. Many thanks for the feedback :))

Do you (founders/employees) have any example queries that don’t work well with Google, and correspondingly what pages you think should be top ranked for those queries?

'sex chat' currently, from my location, 'free chat now' has 2 of the top 3 results. 'i sexy chat' has 2 of the top ten results.

'chat i w ' is number 13, and has been top 10 for much of past couple / few years.. yet they are 'not an adult site' since they run GGL ads...

should be top 5 again.. sexchatsexchat.com has way more content and history..

there are many more sites I could suggest that actually have chat systems running (unlike the porn dood site which is a top 20 link list)

there are many good sites that aren't even in the results at all... these are being gamed by well connected linkers, not ranked by amount of content and length of time people would stay and enjoy.

imbo - in my biased opinion, I have more to add but wonder if it does any good.

a new engine that handles adult better, I would help with.. the other sites listed here do not do these results justice either.. again imbo, ymmv.

Since the Cliqz devs are here, and this engine is based in Germany, a question: does your search engine have any mechanisms for reporting abusive URLs (doxxing, targeted harassment, revenge porn, etc) beyond right-to-be-forgotten, or are you more a lassiez-faire, everything-goes kind of search company?

I noticed that your engine ranks some of the nastier sites on the internet far higher than any other search engine I've looked into.

[Disclaimer: I work at Cliqz] Yes, there is a way to report such urls https://cliqz.com/en/report-url

We do have a list of blacklisted urls/domains mostly regarding adult topic (child porno etc). If you have noticed some bad sites in our results, please feel free to drop a line to our support team using link I provided

Thanks for the reply. A bit disappointed it only counts for extremely illegal content. There's a lot of really negative stuff out there that is blatantly false and manipulative (Ripoff Report, Tumblr callout posts, etc) and it's always a shame that this kind of negative toxicity gets promoted so high in SERPs.

I'd really like it if there were an ethical SERP that at least had some integrity with its results. Reporting factual unflattering statements is one thing (and ideal), but promoting libel feels really dirty, and so far Cliqz seems to be the worst at that of any search engines I've used, and your reporting link seems as though Cliqz is okay with that.

I would like to read this but I can't reach the web server. Is it just me?

Had the same issue with another article from this same site a couple of days ago. Looks like everyone else is able to read it but for some reason not me.

Anyone know what's going on?


Interesting, could you tell what's the error to see.

Other ways you can reach the blog: If you use Tor browser can you try opening: http://cliqzdevxo33b4h6.onion/

Or if you use Beaker browser: dat://ee172d7cd9235b2cf86ea9481e8a40e48cea29c743036621edc79a4765aa0281

Disclaimer: I work for Cliqz.

I get the following error:

This site can’t be reached0x65.dev refused to connect.


Checking the connection Checking the proxy and the firewall ERR_CONNECTION_REFUSED

This happens on Chrome, Firefox, Safari, and Opera on my Mac.

First, let's check if you can open another domain on .dev TLD, like web.dev, if not then:

Seems like you have some mapping for .dev TLD. Assuming based on your mention of Safari, that you are using Mac.

Could you check if you have some setting in your /etc/resolver for dev TLD, or if you are using some service like dnsmasq which is trying to resolve .dev to a non-existent location.

Oh that's weird, I have "nameserver" under /etc/resolver/dev

I am on mac but I didn't touch anything. Is this how mac ships by default? Or do you think some app may have created this file?

Maybe it is just you. For me, it works. Not sure why it does not load for you.

it worked for me.

> The experts, who chose to answer, suggested that we should first start with crawling the whole web. We were told that this would take between 1 and 2 years to complete, and would cost a minimum of $1 billion

Why are costs so high for crawling?

(Disclaimer: I work at Clizq) I don't work on the search, but did some work recently on the crawling part. What I know is that crawling is far more difficult if you are not a big player. Sites will quickly block you once you hit a rate limit.

We have to be very careful, since when we get blocked there is normally no way to get unblocked again. You can try to send them an email to unblock you, but it is unlikely that you get a response. This is one part of the explanation why crawling is slow. The other part is more obvious: the internet is large.

The blocking part is hard to overcome as a small player, while for Google it is the opposite as sites simply cannot afford being exclude from the index. If we would not have to care about rate limits, it would simplify the problem.

I did a large-scale crawl of the web some years ago and we put together a billion pages.

The bulk of the content ended up being index pages - i.e. large list of links taking you to the content - pagination, other breadcrumbs etc.

You can exhaust a lot of resources without getting anything useful.

It is no longer a point and go kind of thing unfortunately, you need a good understanding of page structure, estimate what kinds of links are vital, etc. else there's a ton of crap you'll pick up.

Or maybe I was doing something wrong.

There is no way it costs near a billion dollars to do an effective crawl of the web. That's laughable. I do wonder how much Googlebot spends on bandwidth and servers though.

Wouldn’t common crawl content be enough? If not what are the issues?

No, it's not enough and had poor coverage outside of USA. We have also answered this question (it appears to be popular) in today's post about technical details of our search https://0x65.dev/blog/2019-12-06/building-a-search-engine-fr...

I remember when Cuil launched about 10 years ago they suggested that 1% of the search market was worth a billion dollars so it's big money if you can get inroads. Of course search is probably less important now than it was back then, with discovery happening on social media more and more, but the internet as a whole is much larger than 10 years ago so I wouldn't be surprised if 1% is worth more now days.

We need a smarter search.

One that is based on analyzing the content of a page then on it's page rank.

Self speaking that it has to be open source.

Apache SOLR would be a good starting point.

Cliqz nearly made me stop using firefox a while back https://www.heise.de/-3852129 (german article)

Tldr 1% of german firefox installations automatically uploaded search queries to cliquz. I wont trust a search engine like this with any of my data.

(Disclaimer: I work at Cliqz) Just read the article. It is from 2017 and very short. For the non-German speakers, I have to translate the relevant part:

> Rund ein Prozent der Firefox-Downloads enthalten künftig das Add-On Cliqz, das bereits beim Eintippen Vorschläge für Webseiten anzeigt. Dafür wertet es die Surf-Aktivitäten aller Nutzer aus.

About 1% of the Firefox downloads will contain the Cliqz Addon, which will show you search suggestions for websites while you type. For that, it uses browsing activities of all users.


The last "of all users" is important. Yes, our search is built on data collected from users, but the point is we cannot build profiles of single users; we are only seeing what the whole group of users does. I cannot stress that part enough. We are not Avast.

In fact, we are very open about our data collection system called Human Web:

* https://0x65.dev/blog/2019-12-03/human-web-collecting-data-i...

And this article explains how we provide anonymity while sending:

* https://0x65.dev/blog/2019-12-04/human-web-proxy-network-hpn...

I can understand that you did not like the way that Mozilla rolled it out in 2017. I'm also not glad about how it went (my personal opinion). But from the technical side, I'm more than happy to take any question on that topic (how we collect data in Cliqz).

Would you trust google who does the same thing with chrome? Bing which does the same thing with IE (or whatever it's called now)? Blame firefox for selling out their users not the search engine. I didn't close my account with amazon when Ubuntu started sending searches to them, I just switched distros.

> Would you trust google who does the same thing with chrome?

No, which is why I use neither Google nor Chrome.

> Bing which does the same thing with IE (or whatever it's called now)?

No, which is why I use neither Bing (besides indirect use via DuckDuckGo including it as a data source) nor IE/Edge.

> I didn't close my account with amazon when Ubuntu started sending searches to them,

Amazon didn't and doesn't (last I checked) have a financial stake in Canonical, nor did/does Canonical have one in Amazon. No need to blame Amazon here; that was just Canonical being stupid.

However, per Wikipedia, the same disconnection can't be asserted between Mozilla and Cliqz:

> In August 2016, Mozilla, developer of Firefox, made a minority investment in Cliqz. Cliqz plans to eventually monetize the software through a program known as Cliqz Offers, which will deliver sponsored offers to users based on their interests and browsing history.

> [...]

> On 6 October 2017, Mozilla announced a test where approximately 1% of users downloading Firefox in Germany would receive a version with Cliqz software included. The feature provided recommendations directly in the browser's search field, including for news, weather, sports, and other websites, based on the user's browsing history and activities.

That is: Mozilla invests in a company whose stated business model is literally to scrape my browsing history and shove ads into my browser, and then a year or so later starts A/B testing this as something baked into Firefox for German users. That's scummy, no matter which way you look at it, and no matter how many times Cliqz assures users that "we pinky swear we're not collecting any personally-identifiable data".

I'm taking Cliqz' "we care about your privacy" claims with a hydrostatically-equilibrious and possibly-neighborhood-clearing grain of salt.

I use mainly duckduckgo and only use google if duckduckgo doesn't bring up anything useable (which is far too often for me tbh). And yes I blame firefox for every mishap over the last few years like the certificate expiration, the mr robot "advertising", the cloudflare dns and so on. But I see the good things as well like trowing out avast. So I trust them more than google.

But still I will never trust something like cliqz which belongs to the media gmbh which produces Schund like die bunte.

Ps I tested the beta and the search results werent good Pps I will read the search engine articles thought

There's a difference.

When I use a service from Google, I expect that my data will be parsed by Google. And I can decide if I trust Google or not.

But Firefox sending the urls I visit to a third party (Cliqz) silently and without permission is shady and deceptive.

And then, after all this, Cliqz claims that it's a company built on privacy... sheesh.

An interesting quote for their article: " Philosophically, we believe copying is a loaded term, we prefer to use the term learning. Learning from each other is something all of us do"

What is the difference between copying and learning?

What I miss in this post is a list of references to the huge literature that exists on this topic, and related fields such as NLP.

Also I don't see a clear problem description. What is a search engine, really? How would you compare the quality of two search engines, objectively?

   > Why the second constraint? one might 
   > ask. Besides the obvious potential for 
   > profitability, our mission was
The search engine the world needs is one with independence and non-profitability. If the creators are preoccupied with turning a profit, they’ll introduce the same garbage features as Google. It’s a shame, because a good search engine could shorten the time humanity has to wait for advances (eg: cures for cancers, cheaper energy, etc)

Indexing the web is a resource intensive activity tho - if it was federated then the resource cost only increases. I suppose a non-profit is the alternative, but non-profits are not exactly independent unless they have some sort of massive endowment. I'm not trying to disagree with you, it's just a paradoxical problem: to resolve the issue, resources must be accumulated. Accumulating resources means it's hard to resolve the issue (of an independent search engine).

Weird, your domain (cliqz.com) was blocked by my pihole.

(Disclaimer: I work at Cliqz) We had problems with being blocked in the past.

In cases, where we got a chance to explain, they agree that it is a false positive and took us off the block list. At least, that happened so far in all cases that I'm aware of. However, there are so many lists that it is hard to keep track of them. Would be nice if you could provide some information which block list it is, so we can contact them.

The reason why we end on the blocklist is normally a misconception of our data collection system Human Web: https://0x65.dev/blog/2019-12-03/human-web-collecting-data-i...

If someone does not want to send Human Web data, the feature can also be disabled through the UI. Same if you browse in a private window; Human Web is automatically disabled there. There is no need to configure blocking rules.

Yes, we are on some minor blocking lists because of our data collection, even though is anonymous (please check the articles about Human Web on https://0x65.dev/) sending data, no matter what data, is a sin that has to be punished. A disservice to you ask me, but what can we do. [Disclaimer: I work at Cliqz]

There is a dark side to this story. With Burda https://en.wikipedia.org/wiki/Hubert_Burda_Media, the same people who are behind the Cliqz search engine were originally also behind the German Leistungsschutzrecht. https://en.wikipedia.org/wiki/Ancillary_copyright_for_press_... This law, heavily lobbied for by publishers, forces every search engine and everybody else using content from the internet to pay a private tax of 6% of the revenue (not from profit!). https://www.vg-media.de/de/digitale-verlegerische-angebote/f... As the profit of most internet companies is below this margin, it is essentially forcing many companies out of business.

This tax is enforced and collected by VG Media, the German collecting society representing rights of a group of German publishers. https://www.vg-media.de Between 2013 and 2016 Burda was a shareholder of VG Media, which was commissioned to enforce the tax in its name.

The evil thing of this law is, that the publishers are not required to mark their content in machine-readable form as paid content. And a manual selection is infeasible for internet-scale with billions of pages. So a search engine has no means to bypass the paid content and indexing only free content, e.g. like Wikipedia which makes the majority of the internet content. Essentially the "Leistungsschutzrecht" takes the free content hostage to extort money for using the internet, even if you don't use paid content of the publishers (the just 200 publications the VG Media represents).

So while Burda's Cliqz write on their blog "The world needs more search engines" https://www.0x65.dev/blog/2019-12-01/the-world-needs-cliqz-t... they supported a law that made it impossible for many search engines to operate in Germany (and in the EU via the similar EU law "Extra copyright for news sites" (“Link tax”) https://juliareda.eu/eu-copyright-reform/extra-copyright-for... And while today they are not anymore shareholder of the VG Media, they still benefit from the suppressive legal environment they helped to create, as it prevents any new independent competition to enter the search market

[Disclaimer: I work at Cliqz]

Sorry for taking so long to reply, I was personally trying to dig some information about this. An additional disclaimer: not a lawyer either.

Honestly, I have little idea of how this law affects search engines. What I can say is that we are no paying anything, as AFAIK we do not know anyone who is. Moreover, if some publisher would complain, even one in Burda, we would stop crawling by domain, there is no technical issue here, properties are known by the imprint. We have no say on what the investors do but I can assure you that we have no pressure. For instance, our ad-blocker works everywhere, regardless if the sites are from Burda or not.

On a general level, assuming that what you say is factually correct, I must personally agree that regulation is a bitch. It's typically designed fro big companies to control other big companies, but small ones get negatively affected if only because of the lack of resources. We recently had to suffer all the overhead of GDPR, which consumed a fair amount of our time, relatively we paid a higher price that Google.

Personally, I cannot respond for all the decisions made by the people funding Cliqz, I do not even think I can judge it either. They might be complaining and lobbying, no idea. But they are also putting good money to build a privacy-preserving search engine and a browser, something that no-one else is doing, so on my account they are on the positive side.

How many algorithms are there in chrome alone? I remember when people realized that they could game Facebook shares for higher rankings on chrome and for a while buzzfeed top ten lists outranked Wikipedia every fucking time. I guess that’s still going on. What a clusterfuck search results are nowadays.

If anyone builds anything, please make it so algorithms or queries are archived. I hate how I can’t find anything on the internet that I searched for and found years ago. Its like the history of the internet evaporates every year. I don’t even know if some websites still exist or if I simply can’t find them because rankings are terrible.

I’m to the point that I haven’t been on a new website in years. How do you find new websites in this day and age when the same websites are ranked at the top every time?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact