Love your mission, really hope you can build something competitive enough to make the internet more distributed.
I searched for one of the keywords that rank one of my blog posts on top in Google and it shows someone that copied my article word by word -years later- instead of my site.
Not sure how the algorithm works but you may want to look into original sources and copycats. In this particular case it's not a critical issue (just me, some random guy, losing a visit and ad revenue) but at a larger it could be more serious.
Thanks and for the feedback. Could you share the search you did and offending copycat? You can do that here or using the Feedback link, bottom right on the SERP
Works well, found what I was looking for, using it for a few minutes. What surprised me was how many extremely obscure sites came up before ones that get hit tens of thousands of times a day. Quite difficult to figure out "popularity" without collecting data, so I would bet that this will be a large issue. I feel like making an opt-in anonymous "you are allowed to collect data on which result i click for a given query" would be nice. I would certainly use it.
Overall, I think the name isn't too catchy, but it's nice and unique. You definitely need a logo or something more identifiable. For now, the light green / lime tone is not that nice to look at, I think a more "boring" grey-tone or something blue would be nicer.
Nice work! Looking forward to this becoming a big deal!
> What surprised me was how many extremely obscure sites came up before ones that get hit tens of thousands of times a day.
What a BLESSING! All I ever find on Google/Startpage/Bing/DDG is last weeks product listings, sponsored reviews and influencer "content". Let's embrace randomness!
EDIT: i just managed to find some serialz (for the lulz). Almost like the old days. What a blast!
We are not aware that we have any problem like that.
This might explain why GigaBlast has a problem:
Because of bugs in the original Gigablast spidering code, the Findx crawler ended up on a blacklist in Project Honeypot as being “badly behaved” (fixed in our fork). That meant quite a bit of trouble for us because CDN providers, which are a very powerful hubs for internet traffic, put a lot of weight on this blacklist. Some of the most popular websites and services on the internet run through services like Cloudflare and other CDNs – so if you are in bad standing with them, suddenly a large part of the internet is not available, and we weren’t able index it.
Does this mean your spider is a fork of Gigablast? Is there some additional interesting technical information about how your code/infrastructure is set up?
I realise this is not addressing your second question but you might find it interesting. Post below on server expansion one year ago. We are adding another 100 servers over Christmas and early new year.
Mojeek follows the robots.txt protocol so if a site doesn't want to be crawled by MojeekBot we respect that wish. There is also a generous crawl delay between pages on the same host.
Generally a 'badly behaved bot' will ignore robots.txt or hit a site too hard with requests.
> There is also a generous crawl delay between pages on the same host.
What's the order of magnitude of this delay? milliseconds? hundreds of milliseconds? seconds? I'm curious what's considered 'polite' in this realm and how the various parties come to form opinions on this.
I just had a look and there's a non-standard "crawl-delay directive" extension to robots.txt that can be used to ask a spider to take some time between page visits:
Hello, MojeekBot doesn't observe the crawl-delay directive but thanks for the reminder of it as it's beneficial for us to know if site owners require more grace between requests.
No need to go into details, but how are you for usage and profitability?
I remember Gabe Weinberg mentioning on here that DDG became profitable pretty early on, so I'd be interested how easy it is to meet that threshold and how sustainable it is to create search competitors!
Great question and the one that matters. We are not profitable and although Mojeek was incorporated in 2009 it is only as of July 2020 (when I started) that we started a serious go-to-market push.
Mojeek has been built from the ground-up meaning we do not depend on anyones technology. We have our own servers, crawler, index, ranking and so on.
So unlike most we are not using Bing, Google, Yandex or anyone else's. Our road to sustainability is a different one; we are a technology company.
Here are two posts on how we are funded and our business model; that post also covers privacy and (lack of) surveillance practices.
I think it's worth mentioning that DDG does not do it's own crawling, they are using Bing API and therefore don't need that much infrastrcuture (except the presentation layer).
This means that they just need to sell enough advertising (which I think bing provides / forces them to use) so basically for DDG, more visitors = more revenue and breakeven means that the traffic pays for the costs they have.
Understandable, and also a reason why I'd be interested to know if the infra is already being covered or is being stood up with runway.
If I were to create a privacy first search engine, I would likely not go general purpose and would instead try to focus on specific use cases first for specific audiences (that would be willing to pay). In practice this would look more like an extended version of DDG's instant answers than a Google-alike, but I think the world is convinced that general purpose web search is a) necessary and b) the only valid approach to finding answers to questions.
b) is an oversimplification, but I'd consider it to be a thought experiment whenever a story about Google or Duckduckgo comes up.
You will always find:
><Search Engine> is <bad/good> and the results are way <worse/better> than <Another search engine>.
><Search Engine> is only a frontend for <Parent search engine> <but/and> does(n't) do their own indexing.
>Results for <Topic (usually technical)> (in <country>)? on <search engine> are garbage. Search results for <purpose> are poor.
What you will see as a response:
>You should use <search engine> instead! It uses <different parent search> instead so the results are better!
>I agree, I didn't like <search engine> so I had to go back to Google.
What you won't see:
>Nowadays I only use <search engine> for some search, most of the time I use <specific source(s)>.
>Maybe you should use specific searches for specific purposes?
The two that you cited are very valid, and I'm sure many people use specific sources for specific things (most doing it unknowingly via apps). The main thing seems to be that (on especially desktop) the single purpose all-encompassing web portal seems to be the way to find something. The phrase "Google it" is a catch all for 'do your research'.
Hey mate, first, kudos for all the work you've put into developing Mojeek up to this level. That seems like an enormous achievement for such a small team as yours.
Then, I'm wondering do you guys have the notion of "product brands" or something like that?
For example, when I search for "SaaSHub" - a product (I work on) that has been online since 2014 and have quite a few mentions around the web, the first result is some "random" Wordpress theme. i.e. if I'm searching for a particular product, I'd expect the product homepage to be the first result.
The feedback is appreciated and your example is one for a generic challenge that we are actively working on. I note that your site shows as link #2 so someone looking for your site/brand should see it; but still your point stands.
Wow this is a terrific answer to the exact question of "what's a good unbiased web craweler" I asked yesterday in another thread [0].
Do you disclose what goes into your ranking algorithm? I think having that transparency and perhaps being able to tweak some ranking parameters would be go a long way to being able to verify it's built in the best interests of the users.
That's not a challenge right now, but yes could become more of one.
A balance in ranking for those seeking information and those looking to promote content will always be an issue; at any level of transparency. Done right, more transparency can benefit seekers too.
That's a big question and the details would require a long answer. A short one is that it's mostly algorithmic aided by human feedback to improve; manual intervention is rare.
I am using Mojeek for more than one year now, together with other less known search engines. I mainly use it through eTools metasearch : eTools let you decide on which search engines you want to perform a search. You can define 15 search engines (including Google) and decide if they are important or not.
Sometimes, less than 10 times _a month_, I need to launch Google to check an address on maps or some images.
Mojeek is very useful for me and cover quite my needs, which is a bit of a surprise considering the variety of subjects I look for.
I also run a little Yacy node.
Everyone wanted to understand how very few search engines are real crawlers, should open "The Search Engine Map" https://www.searchenginemap.com created by Mojeek.
I tried this a few days ago. The results reminded me of search engines a decade ago - when keyword hacking for SEO placement had caught on, but search engines hadn't learnt to work around it. (I can't access that history now, but one of the searches I tried was 'rec file viewer', and the results were dominated by the many crappy "file extension info" sites riddled with ads and "Is This A Virus" FUD.)
Btw I'm not a native English speaker and mojeek seemed like a great name to me immediately. Definitely easier to remember than metager and YaCy, on par with qwant (I've been looking into alternate search engines recently). There were (and still are) people who thought DuckDuckGo is a terrible name - and even if that's true, they've shown it's not a significant obstacle at all. Good luck with your results, those are the only thing that really matter.
I applaud an independent, new spider and search engine.
Sadly, the name is pretty horrible.
If you have to spell a domain out letter by letter when telling people in person, it’s not a good name. You also don’t appear to own common misspellings, like mogeek.com or moejeek.com or mojeak.com or at least they don’t redirect to the main website.
Still, I hope it gains market share and usage and I hope in a few years you may have the funding necessary to rebrand it into a more user friendly name.
I dislike the name too, but people had trouble spelling google as wel in the early days. I’m from The Netherlands where it’s not immediately obvious how to spell google. But Google had a very high market share here, almost instantly, while it took them a lot longer to beat Yahoo in the US.
It is pronounced ‘moh-jeek’ with a hard ‘j’. In many countries this is an unnatural sound, and so we encourage the use of ‘moh-yeek’ to those folks who find our name near unpronounceable.
As opposed to Google, Yahoo, Bing? Mojeek seems equally frustating when it comes to naming... but if they become popular, everybody will assimilate the name.
Give me a way to pay for this, I want to be a customer instead of the product. What is the lowest per user price you can offer, if 1% of them paid reliably? I assume it scales inversely with % who actually pay.
"Search engine companies typically earn revenue from advertising (in many different forms). Other sources of revenue can include APIs and partnerships, including site and enterprise search. Recently search companies have started to explore a subscription model and/or micropayments. At Mojeek we are exploring options for our business model and all of these are potential routes for us; indeed we have had small revenues from some of them without being focussed on revenue streams. In 2021 we will start to focus on which revenue options to pursue proactively. Enabling people to search on Mojeek without tracking (aka surveillance) is a something we will not compromise on, so any routes we go down will be guided by our privacy-by-design principles."
Consider the use of capability tokens instead of username/password pairs to grant access to features. This would allow someone to sponsor someone else without sharing any information.
Mojeek generates a signed token that IS authorization, you match it against a list of valid tokens, with no other information required.
This could be used to federate subscriptions, and hosting of front ends such that they could stand alone without having any knowledge of the users, while only serving valid users.
Flickr has something like this that grants access to an otherwise private portfolio, yet the user can revoke it if necessary, without ever sharing passwords, usernames, etc.
This is a very good suggestion, do you have anything specific in mind? We've been looking at the the WAC spec (part of the Solid project) which is related, though different, and interesting.
https://github.com/solid/web-access-control-spec do you have any thoughts on that?
Access control lists are more flexible than User,Group,World type permissions, but are nowhere as powerful or composable as capabilities.
[Edit] - Example: On a linux machine, how could you give access to only one file in the whole system? Answer: By setting the permissions on every single file other than the one in question to deny access. Set the permission to allow access on the one file you care to share.
With Capabilities, the token IS the permission... and it doesn't really take much to implement it, once you completely grok the idea.
Second this, I would be very happy to pay for better search. Of course ideally I would hope it possible for you to become a nonprofit or at least a B-Corp, I feel a nonprofit search engine merits as much funding as the Internet Archive or Wikipedia.
I just tried a search for a relatively obscure game I like, and I was pleasantly surprised to see lots of links to websites I hadn't seen before. I use DDG and while it's better at this than Google, Mojeek seems to be better at showing more obscure sites. To me, this is a blessing.
Of course, that's just one search, but based on this one search alone I've added Mojeek to my bookmarks and will be using it to find the more off-beat paths on the net now.
Thanks. I bring this up just because of very real geopolitical information/misinformation wars that we seem to find ourselves in today, that it is important for the reputation of your company to be very clear of the sources of funding, or any other indications to whom the Mojeek may be beholden. I say these things with the greatest respect, and with a great amount of excitement, so that your company can gain the confidence you will need to grow.
What happens if a certain government authority comes knocking, demanding all logs for a certain time period and IP range?
User IP addresses are not logged, so it wouldn’t be possible for us to provide that information; we simply do not have it.
Have you ever received any requests for information from any authority?
At the time of writing we have not received any requests for information from any authority. If we were to be compelled by a court or similar to give over information, all we would ever able to hand over would be uncoupled search queries, referral data, and requested pages. We log, on aggregate, where traffic comes from at a country level, but we do not and never have logged or stored individual IP addresses. Any information about browsers used is uncoupled from search queries.
Does being in the UK subject you to surveillance under the Snooper’s Charter?
The Snooper's Charter (or Draft Communications Data Bill) was only a bill, it was never implemented. The similar ‘Investigatory Powers Act’ was passed, but that applies to Communication Service Providers, not us. As a non-tracking search engine which respects privacy, we don't store any IP addresses or other information that would be useful to the UK Government if they were looking to identify individuals and if we were compelled in court to hand it over. This all being said, we encourage all Mojeek users to use a wide variety of anti-surveillance and pro-privacy tools when browsing the Web.
Can you be a private option whilst being resident in a Five Eyes member country?
The Five Eyes is an intelligence alliance comprising Australia, Canada, New Zealand, the United Kingdom, and the United States. As Mojeek does not store IP addresses, there would be very little useful information to hand over to the Five Eyes nations were they to compel us to give them data or to allow them to access to Mojeek’s backend.
First, congrats! Search is super-ambitious and a new index getting traction could be game changing for lots of people. It appears that Mojeek is indexing translated documents, which is exciting for unlocking knowledge, but really leads to a bizarre initial user experience. If indexing translations is what is going on, this is kind of exciting... but will be challenging to deliver better user experiences than single language indexes.
Stemming (grouping similar terms like "mortgage" and "home loan") seems bizarre. I did a search for my last name, and it came back with mostly results where the name was using the German spelling, which made the results mostly worthless. (the difference is the German spelling ends in "l", the French version ends in "le" and the English spelling ends in "el"). So, the resulting search results were mostly useless.
Here is a use case for testing: search for “Online Wishlist” and if you can show DreamList.com, or Giftster, or Amazon Wishlist, and other providers of universal online wish list services at the top instead of millions of individual wish lists or retailers, you are easy to use. There are a lot of dead sites out there, and part of the job is flushing those out and keeping the most heavily used ones on top. I would bet that Google analytics heavily influences rankings for google (if you have a site and remove analytics, watch what happens to its search traffic).
Google analytics allows Google to collect more data, of course. I don't know if there is a "payback" on SEO rankings in that case but there should not be.
Interesting point about universal lists. Yes, there are a lot of dead sites out there but there are also a lot of good sites that buried way down in many search engines, by big brands and SEO.
Certainly a great initiative and this is something I'd happily pay for if it were reliable for my use cases.
However, I'm presently seeing very few results from popular sites that have a lot of user content. Github, Steam Workshop etc.
Searching "ge_tts" (an admittedly very niche project of my own) yields no useful results, where as the project page is first result for both Google and Bing.
EDIT: Increasing the weight for fuzzy text match search results would also be nice. "threejs" doesn't yield a Github result for at least the first 5 pages, but "three.js" works as expected.
I'm super impressed that you've kept all the infrastructure in-house; I'll definitely be trying this out as my default search for the next few weeks.
One thing I noticed was that searches often return pages in foreign languages, e.g. searching 'Wikipedia' pulls up de.wikipedia.org and es.wikipedia.org as search results in addition to www.wikipedia.org. I'm sure you're looking more into how to best handle page languages, but I figured I'd mention it for the sake of user feedback :)
It's great to see the focus on privacy, but while Mojeek claims that its technology is new and was built from the ground up [1], it seems to fall down some of the same holes as Google. For example:
Good luck to Mojeek, but at some point we'll need to have a search engine that's not dumb as bricks five times a day.
__________________
[1] Mojeek's technology has been developed entirely from scratch by Marc Smith, mostly using the C programming language, and uses no pre-existing search or web crawler technology. All technology and IP is fully owned by Mojeek Limited.
1) I am building a browser [1] for the semantic/static web, would I be allowed to integrate this as a search alternative to searx.me and wiby.me?
2) What's your policy on submitting URLs? Do you support opengraph descriptions? Do you support dublincore (which would be awesome)? Should browser users be allowed to submit things they couldn't find?
3) I couldn't find detailled API docs, only the opensearch description. Do you have a JSON API that would help reduce the load of your web servers? Also, what syntax do you support?
1) Yes you can include it as the default or optional search engine.
2) We used to have an add URL page but a significant proportion of the submissions were spam. We will look into ways genuine submissions could be accepted as lots of people have asked for this.
I know this is a bit late to the party, but I'm not finding on your page any ability to build up search queries that are more than just a few terms, as in, I can't seem to be able to provide restrictions within the search query itself - search only in this site, search with term A AND term B, etc.
If this isn't implemented, any idea on when or if you'll implement it at all? :) Love the idea behind the project, but I'm feeling a bit apprehensive seeing as a lot of these projects to build more privacy focused engines only lead to poorer results with a lot of the pages having questionable practices still.
Awesome! By the way, if you ever monetize using advertisements, might I suggest using time-based bidding for keywords similar to a billboard, that way you really don't have to track anyone to make ad revenue, too.
Cheers. We will include ads of verious types; yet to be decided upon, but starting in Q1 2021 with a first set of partners/advertisers. These will no doubt evolve but yes, will be done with out tracking; we won't be passing on IP in the way some other so-called "privacy" engines do. So yes, keyword/search query based.
Super cool. Very happy you got some great heads on it. I think time-based ads with an estimate for how many queries went through and saw it, and then an after-ad-session count of how many people likely saw it based on whatever stat tracking you do will be great.
I get the lack of personalized tracking, makes sense. I don't want that. But do they at least track, for a given query, which results are clicked or not (not attributed to a person, just a query string) for relevance evaluation and optimization? For most search engines, this is usually foundational.
I'd be genuinely curious if they use a different method for this sort of thing.
What I’d like is a “this recommendation sucks” button next to queries. I don’t care if it’s personalized or not, it’s just annoying to get 3 pages of lousy results on google because what you are looking for happened more than 3 months ago and have no way to say “what the heck is this?”
This is great. I've added it to my phone as my default search engine to try out.
One thing I did notice because I mistyped the search string while adding it manually as a search engine in Firefox, is that if you hit a URL on http instead of HTTPS, you'll get a 403. Might be smoother just to redirect. I say that having seen someone affiliated with mojeek in the comments.
Do you plan to leave your index uncurated, will self-moderate or will allow reporting content like spam and link farms?
My opinion is that leaving it uncurated isn't really perfect, but moderation isn't ever perfect either. I wish I had some insight into this. I'm curious.
I've tried a few other cases like those. It seems to me that Umlaute are fine when only a single word is used, even quoted, but a quoted string of several words where one has an Umlaut leads to the error.
Regarding dates, on top of the default ranking behaviour, there are 'since' and 'before' operators to specifically look within a timeframe [0]
There's also an API feature that allows preferring newer documents over old (datewr) [1] It will eventually make its way as a search preference you can set for your own convenience. In the meantime it works by typing it into the URL.
We are also working on the core algorithm to better identify when the main content of a page has changed.
I work on the tech side of Mojeek, thanks for the feedback it's appreciated.
We have been tweaking our language detection and intend to roll out some changes in the coming months to give heavier preference to the user's preferred language
Striked, I find funny that searching the web still has the same user experience than in the 90' while most web sites are now apps. Feels like this model is outdated.
Hello, thanks for the honest feedback. There is a link at the bottom of the search result page for the particular query you used, would be much appreciated if you let us know via that in order to evaluate where the problem lies.
Hmm, OK, that's two big questions and goes to the core of what benefit we provide, what we do and our values.
We've put answers to some of these issues and notably these questions :
What data do you collect?
What happens if a certain government authority comes knocking, demanding all logs for a certain time period and IP range?
Have you ever received any requests for information from any authority?
We are exploring working with commercial partners who want data to deliver services (shopping, ads). We will only pass along search queries and location.
We will never provide IP; even though we know some partners pressure for it, or even insist on it.
That's works as follows:
1) By default we detect location based on a Geolocation service at the city level based on IP address
2) Users can also override that by setting their location (currently country), if they wish. https://www.mojeek.com/preferences?tab=location
It's not the expiry that's an issue. Firefox says:
"Websites prove their identity via certificates. Firefox does not trust this site because it uses a certificate that is not valid for www.mojeek.com. The certificate is only valid for the following names: dock.shp.mcafee.com, *.dock.shp.mcafee.com
Error code: SSL_ERROR_BAD_CERT_DOMAIN"
Could be caused by the extensive (68k lines) of blocking in my /etc/hosts file, I suppose.
Phew, we'd not heard of anyone else with this issue but it still made me a little worried when you mentioned the problem. I'm pleased you've got it working now though, thanks for letting us know.
my ISP is using mcafee, which apparently is marking your site as "dangerous". I'm not 100% sure how they are doing this, I assume something during DNS lookup.
Thank you, that’s quite worrying, I hope it doesn’t affect many people. I don’t suppose there’s any way of you flagging it as a bug as a customer is there? (I hope that doesn’t sound cheeky!)
I talked to Century Link, and realized that they install a Mcafee "anti-virus" tool on their routers. I disabled this on the router, and the problem went away.
You might want to talk to Mcafee about their classifying your site as dangerous, or just curse them out like I've done when they listed my software as malware. For what it's worth, they do the same to emacssurvey.org so it's not just you :)
> As it happens we were toying with taglines today and minded to go with these:
I very much like the "rediscover" part. The web has indeed lost it's charm when the results only show Facebook, Twitter and 10 or so other huge sites.
As for tagline, I'd suggest something along the lines of "Rediscover the web and escape your filter bubble!".
Wish you all the best and happy coding!
P.S. I'd be very curious to get insight into if you have considered open sourcing the search engine? This would tie into transparency and trust, and would also enable contributions! I'd love to contribute to my day-to-day search engine, especially if it is privacy respecting!
Also, I second a subscription model for sustainability. I do not want to be the product.
Perhaps part of your failing is expecting things to appear for words you never searched for. I don't expect the world's knowledge from now until time immemorial to equivocate "Face" with a 10 year social fad.
It would be great if there was an option to accommodate those who use search engines as their address bar, while also leaving an alternate mode for those wanting to search for content.
Thanks for the feedback, you are right in that there are no results covering those languages to date.
We are currently focusing on Romance/Latin European language documents as those languages in themselves have a huge amount of content to crawl and index. As our index expands we will increase coverage to index documents in other languages.
I expected this to be a biology story. I love that "web spider" tries to disambiguate, but utterly failed to for me because of the whole web crawler spider naming being a metaphor for the natural ones.
I searched for one of the keywords that rank one of my blog posts on top in Google and it shows someone that copied my article word by word -years later- instead of my site.
Not sure how the algorithm works but you may want to look into original sources and copycats. In this particular case it's not a critical issue (just me, some random guy, losing a visit and ad revenue) but at a larger it could be more serious.