>Queries performed by people, if associated to a web page, serve as even cleaner summaries than anchor text. This is because all the logic put in place by the search engine, who resolved the query with a list of web pages, and all human understanding and experience that led one to select the best page from the offered result list end up embedded in the association <query, url>.
This would seem to present a "rich get richer" problem where the oldest links that have the largest click-through tend to float to the top making it difficult for a new result that may be "better" to appear high in the search results. Anyone know how search engines tackle this problem?
If it's shown in position one, it should be expected to get more clicks than position ten. If something gets more than expected, move it up.
Your point is spot on. Old pages tend to have more association to seen queries, which does not play in favor for new pages.
That said, however, there are a couple of things to consider: 1) seen queries is not the only way to create queries, we are pretty good creating synthetic queries based on the content, descriptions, etc. This queries are more noisy that the seen queries of course, but good enough. And 2)novelty, freshness and popularity are very important features on the ranking. Feel free to try out any new topic you might think of on https://beta.cliqz.com, you will see that is not only "stale" content.
It is still a little fuzzy to me. What is a "synthetic query"? Is this basically generating queries that would match the content (i.e. essentially reversing the process)?
Novelty, freshness are interesting but can lead back to the noise problem mentioned in the blog. If many pages are created that may match the query (e.g. "best new movies") many young pages will match this. Popularity would be useful but difficult to establish and then there's the clickbait and other gaming problems.
"Never attribute to malice that which can be adequately explained by stupidity."
Two years and three domains with the typo? It's malice.
That, or any kind of exploration/optimization algorithm, to be honest.
You are on point. Recency is a challenging problem in multiple ways for search engines. Not just limited to discovering new content, but also how does one index it? How does one balance out when you have for the same query "very new", "new", "slightly old" and "really old" results during ranking. This involves both news as well as new webpages surfacing on the web.
On top of this, we have to remember that this is a fully autonomous real time system which requires solving some of the most difficult engineering challenges at scale and at the same time being mindful of the latency and quality constraints.
At the end of the day, it's all about the final user experience that we ship. We are very much mindful of the same. We will be publishing more details about Cliqz search, on our blog https://0x65.dev/ in the coming days, so stay tuned.
Just as an FYI, this company Cliqz is owned by Hubert Burda Media, a large media conglomerate based in Germany.
That doesn't necessarily inherently mean anything negative, but it's important to understand the potential underlying incentives given their marketing as such a strongly privacy oriented service.
They mostly push a narrative of privacy and censorship, when in the end the answer is probably close to "we want a piece of the pie" or "we want to be that monopoly".
Choice is good.
1) Trackers Stats. Essentially, you can see how many and what trackers there are on the page you are about to visit. Before visiting it.
2) Page previews (I'm not sure about whether I like that)
This feature is powered by another project we run, where we measure the tracking landscape in the web (most popular domains): https://whotracks.me. Details on how that works can be found in our paper . Also - we are flirting with the idea of providing a mode where the ranking is informed by the trackers in the destination site. Would love to hear your thoughts on whether you'd like smth like this.
> 2) Page previews (I'm not sure about whether I like that)
At the moment it's only a placeholder for a lengthier title and description (if available), but we are planning to use the space for rendering a short summary of the content/media in that site + similar sites in terms of content (query-relevant of course). This is more work in progress as we want to make sure content creators are on board. Again: would love to hear your thoughts on that.
Disclaimer: I work at Cliqz.
 WhoTracks .Me: Shedding light on the opaque world of online tracking - https://arxiv.org/abs/1804.08959
That's definitely the right way to go. I would also very much appreciate an option not to show in the result list sites/pages having any trackers.
I noticed that even Wikipedia is reported as having some trackers. But when I looked closer I noticed that most of those belong to the Wikimedia foundation, which is fine. I mean, I don't mind site owners tracking what I do on their site, I just don't want to be followed across the whole Web.
The rest of the Wikipedia trackers are supposed to be Google fonts and statics, but I couldn't witness any calls to those. Maybe the stats are not quite up to date?
If such a score is to be given, it better be fair and reflecting the current state of affairs.
These stats are updated monthly, and based on millions of loads of each site. The WhoTracks.Me page for wikipedia.org (https://whotracks.me/websites/wikipedia.org.html) shows that the Google Fonts and Google Static trackers occur very infrequently (<2% of pages), so may be on some part of the site that you did not visit.
While the Wikimedia tracker may seem innocuous, they do set a cookie that is sent in third-party contexts, and have presence across several sites beyond Wikipedia (133 of the top 10k) (https://whotracks.me/trackers/wikimedia.org.html). Theoretically, they could track user sessions across these sites. In reality this is likely an oversight in the server configuration, but objectively this profile looks no different to that of a legitimate tracker.
I want to be able to open a jupyter-like notebook with the start of my search query, and the first step should be to show me the available eigencontexts, from which I can establish the gross context for my entire search. After this first click, none of the results should be about the board game or the english word--unless the relevant search results happen to include an implementation of Go the board game in Go the language.
And then when I'm done, I want to name and archive that notebook so I can return to it at a later date--whether to refresh my memory of the ultimate answer, or to continue the search.
I guess I would call this a 'research engine' instead of a search engine.
I hate to answer this one, becasue it looks too much marketing-speech but this feature exists. Not on beta.cliqz.com but on the drop-down search on Cliqz browser.
Based on the tabs you have opened, different query expansions are selected. For instance as you type "hotel in ma..." probably would show you results for Mallorca, but if you have "Madrid" on a tab, then it will show results for "hotel in madrid".
There will be a blog post about this contextual search because it's our showcase that is possible to do personalization without compromising privacy. All this is done privately, the browser receives results for multiple expansions and can chose which one to display based on local information. We never track or collect sessions of users.
Interestingly the working name for my idea was also "ResearchEngine", so i guess it summarizes pretty well the unmet need you and me have.
AltaVista used to do this. I miss it terribly.
I also want to maintain a longer-term list of pages/sites to exclude. Like, unless otherwise specified, I never want results from w3schools or expertsexchange.
This field is ripe for disruption, in my opinion. We can do so much better, but I've not seen any serious attempts.
> Europe has failed to build its own digital infrastructure. US companies such as Google have thus been able to secure supremacy. They ruthlessly exploit our data. They skim off all profits. They impose their rules on us. In order not to become completely dependent and end up as a digital colony, we Europeans must now build our own independent infrastructure as the foundation for a sovereign future. A future in which our values apply. A future in which Europe receives a fair share of the added value. A future in which we all have sovereign control over our data and our digital lives. And this is exactly why we at Cliqz are developing digital key technologies made in Germany.
I have an uncomfortable feeling, however that this is when the walls really start going up in the internet, beyond just the dictatorships.
Especially seeing as how Europe is not a nation.
> A future in which we all have sovereign control over our data and our digital lives. ... And this is exactly why we at Cliqz are developing digital key technologies made in Germany.
Ahhh, that makes more sense, "sovereign" == "made in Germany."
"Hiybbprqag?" How Google Tripped Up Microsoft
But what I would really like to see, as has been mentioned in other threads, is an open-source or community-funded search engine. Something that "belongs to the web" itself, so to speak, and not to any particular corporation.
Tried many times to use it but since it was a resource hungry java application that required me to use the web through a http proxy to contribute it wasn't really useable for me at any time. Also the search results were mostly garbage for me.
We will have a blog post tomorrow on this very topic, but in short, we use a combination of Keyvi, Granne (both in-house) along with Cassandra and RocksDB.
Though our approach mentioned in this blogpost significantly reduces the storage needed to host the index, we still have an index of around 50 TB of data.
FYI, there will be a bunch of articles regarding search in the next week.
'chat i w ' is number 13, and has been top 10 for much of past couple / few years.. yet they are 'not an adult site' since they run GGL ads...
should be top 5 again.. sexchatsexchat.com has way more content and history..
there are many more sites I could suggest that actually have chat systems running (unlike the porn dood site which is a top 20 link list)
there are many good sites that aren't even in the results at all... these are being gamed by well connected linkers, not ranked by amount of content and length of time people would stay and enjoy.
imbo - in my biased opinion, I have more to add but wonder if it does any good.
a new engine that handles adult better, I would help with.. the other sites listed here do not do these results justice either.. again imbo, ymmv.
I noticed that your engine ranks some of the nastier sites on the internet far higher than any other search engine I've looked into.
We do have a list of blacklisted urls/domains mostly regarding adult topic (child porno etc). If you have noticed some bad sites in our results, please feel free to drop a line to our support team using link I provided
I'd really like it if there were an ethical SERP that at least had some integrity with its results. Reporting factual unflattering statements is one thing (and ideal), but promoting libel feels really dirty, and so far Cliqz seems to be the worst at that of any search engines I've used, and your reporting link seems as though Cliqz is okay with that.
Had the same issue with another article from this same site a couple of days ago. Looks like everyone else is able to read it but for some reason not me.
Anyone know what's going on?
Interesting, could you tell what's the error to see.
Other ways you can reach the blog:
If you use Tor browser can you try opening:
Or if you use Beaker browser:
Disclaimer: I work for Cliqz.
This site can’t be reached0x65.dev refused to connect.
Checking the connection
Checking the proxy and the firewall
This happens on Chrome, Firefox, Safari, and Opera on my Mac.
Seems like you have some mapping for .dev TLD.
Assuming based on your mention of Safari, that you are using Mac.
Could you check if you have some setting in your /etc/resolver for dev TLD, or if you are using some service like dnsmasq which is trying to resolve .dev to a non-existent location.
I am on mac but I didn't touch anything. Is this how mac ships by default? Or do you think some app may have created this file?
Why are costs so high for crawling?
We have to be very careful, since when we get blocked there is normally no way to get unblocked again. You can try to send them an email to unblock you, but it is unlikely that you get a response. This is one part of the explanation why crawling is slow. The other part is more obvious: the internet is large.
The blocking part is hard to overcome as a small player, while for Google it is the opposite as sites simply cannot afford being exclude from the index. If we would not have to care about rate limits, it would simplify the problem.
The bulk of the content ended up being index pages - i.e. large list of links taking you to the content - pagination, other breadcrumbs etc.
You can exhaust a lot of resources without getting anything useful.
It is no longer a point and go kind of thing unfortunately, you need a good understanding of page structure, estimate what kinds of links are vital, etc. else there's a ton of crap you'll pick up.
Or maybe I was doing something wrong.
One that is based on analyzing the content of a page then on it's page rank.
Self speaking that it has to be open source.
Apache SOLR would be a good starting point.
Tldr 1% of german firefox installations automatically uploaded search queries to cliquz. I wont trust a search engine like this with any of my data.
> Rund ein Prozent der Firefox-Downloads enthalten künftig das Add-On Cliqz, das bereits beim Eintippen Vorschläge für Webseiten anzeigt. Dafür wertet es die Surf-Aktivitäten aller Nutzer aus.
About 1% of the Firefox downloads will contain the Cliqz Addon, which will show you search suggestions for websites while you type. For that, it uses browsing activities of all users.
The last "of all users" is important. Yes, our search is built on data collected from users, but the point is we cannot build profiles of single users; we are only seeing what the whole group of users does. I cannot stress that part enough. We are not Avast.
In fact, we are very open about our data collection system called Human Web:
And this article explains how we provide anonymity while sending:
I can understand that you did not like the way that Mozilla rolled it out in 2017. I'm also not glad about how it went (my personal opinion). But from the technical side, I'm more than happy to take any question on that topic (how we collect data in Cliqz).
No, which is why I use neither Google nor Chrome.
> Bing which does the same thing with IE (or whatever it's called now)?
No, which is why I use neither Bing (besides indirect use via DuckDuckGo including it as a data source) nor IE/Edge.
> I didn't close my account with amazon when Ubuntu started sending searches to them,
Amazon didn't and doesn't (last I checked) have a financial stake in Canonical, nor did/does Canonical have one in Amazon. No need to blame Amazon here; that was just Canonical being stupid.
However, per Wikipedia, the same disconnection can't be asserted between Mozilla and Cliqz:
> In August 2016, Mozilla, developer of Firefox, made a minority investment in Cliqz. Cliqz plans to eventually monetize the software through a program known as Cliqz Offers, which will deliver sponsored offers to users based on their interests and browsing history.
> On 6 October 2017, Mozilla announced a test where approximately 1% of users downloading Firefox in Germany would receive a version with Cliqz software included. The feature provided recommendations directly in the browser's search field, including for news, weather, sports, and other websites, based on the user's browsing history and activities.
That is: Mozilla invests in a company whose stated business model is literally to scrape my browsing history and shove ads into my browser, and then a year or so later starts A/B testing this as something baked into Firefox for German users. That's scummy, no matter which way you look at it, and no matter how many times Cliqz assures users that "we pinky swear we're not collecting any personally-identifiable data".
I'm taking Cliqz' "we care about your privacy" claims with a hydrostatically-equilibrious and possibly-neighborhood-clearing grain of salt.
But still I will never trust something like cliqz which belongs to the media gmbh which produces Schund like die bunte.
Ps I tested the beta and the search results werent good
Pps I will read the search engine articles thought
When I use a service from Google, I expect that my data will be parsed by Google. And I can decide if I trust Google or not.
But Firefox sending the urls I visit to a third party (Cliqz) silently and without permission is shady and deceptive.
And then, after all this, Cliqz claims that it's a company built on privacy... sheesh.
What is the difference between copying and learning?
Also I don't see a clear problem description. What is a search engine, really? How would you compare the quality of two search engines, objectively?
> Why the second constraint? one might
> ask. Besides the obvious potential for
> profitability, our mission was
In cases, where we got a chance to explain, they agree that it is a false positive and took us off the block list. At least, that happened so far in all cases that I'm aware of. However, there are so many lists that it is hard to keep track of them. Would be nice if you could provide some information which block list it is, so we can contact them.
The reason why we end on the blocklist is normally a misconception of our data collection system Human Web:
If someone does not want to send Human Web data, the feature can also be disabled through the UI. Same if you browse in a private window; Human Web is automatically disabled there. There is no need to configure blocking rules.
This tax is enforced and collected by VG Media, the German collecting society representing rights of a group of German publishers. https://www.vg-media.de
Between 2013 and 2016 Burda was a shareholder of VG Media, which was commissioned to enforce the tax in its name.
The evil thing of this law is, that the publishers are not required to mark their content in machine-readable form as paid content. And a manual selection is infeasible for internet-scale with billions of pages.
So a search engine has no means to bypass the paid content and indexing only free content, e.g. like Wikipedia which makes the majority of the internet content.
Essentially the "Leistungsschutzrecht" takes the free content hostage to extort money for using the internet, even if you don't use paid content of the publishers (the just 200 publications the VG Media represents).
So while Burda's Cliqz write on their blog "The world needs more search engines" https://www.0x65.dev/blog/2019-12-01/the-world-needs-cliqz-t...
they supported a law that made it impossible for many search engines to operate in Germany (and in the EU via the similar EU law "Extra copyright for news sites" (“Link tax”) https://juliareda.eu/eu-copyright-reform/extra-copyright-for...
And while today they are not anymore shareholder of the VG Media, they still benefit from the suppressive legal environment they helped to create, as it prevents any new independent competition to enter the search market
Sorry for taking so long to reply, I was personally trying to dig some information about this. An additional disclaimer: not a lawyer either.
Honestly, I have little idea of how this law affects search engines. What I can say is that we are no paying anything, as AFAIK we do not know anyone who is. Moreover, if some publisher would complain, even one in Burda, we would stop crawling by domain, there is no technical issue here, properties are known by the imprint. We have no say on what the investors do but I can assure you that we have no pressure. For instance, our ad-blocker works everywhere, regardless if the sites are from Burda or not.
On a general level, assuming that what you say is factually correct, I must personally agree that regulation is a bitch. It's typically designed fro big companies to control other big companies, but small ones get negatively affected if only because of the lack of resources. We recently had to suffer all the overhead of GDPR, which consumed a fair amount of our time, relatively we paid a higher price that Google.
Personally, I cannot respond for all the decisions made by the people funding Cliqz, I do not even think I can judge it either. They might be complaining and lobbying, no idea. But they are also putting good money to build a privacy-preserving search engine and a browser, something that no-one else is doing, so on my account they are on the positive side.
If anyone builds anything, please make it so algorithms or queries are archived. I hate how I can’t find anything on the internet that I searched for and found years ago. Its like the history of the internet evaporates every year. I don’t even know if some websites still exist or if I simply can’t find them because rankings are terrible.
I’m to the point that I haven’t been on a new website in years. How do you find new websites in this day and age when the same websites are ranked at the top every time?