
Mojeek – 1.5B Pages and Growing (2016) - Sami_Lehtinen
https://blog.mojeek.com/2016/12/1500000000-pages-and-growing.html
======
kbd
Visited their blog, typed a simple search for "python"[1] into their search
box, and got "Sorry your search appeared automated and was blocked."

[https://www.mojeek.com/search?q=python&site=](https://www.mojeek.com/search?q=python&site=)

~~~
mojeek
Hi I'm Marc the developer of Mojeek. Sorry about this, I'm aware it's
happening too often and will look into it. If you press refresh twice without
doing anything else that should fix it.

~~~
visualdensity
Also getting the automated error message. Sorry but it's practically useless
at the moment :/

~~~
mojeek
Did you reload the error message page twice? That should fix it, if not please
try searching from the homepage and let me know if neither works, thanks.

------
i336_
Have some amusement courtesy of
[https://www.mojeek.com/preferences](https://www.mojeek.com/preferences):

    
    
      Crawl date: [x] no (default)  [ ] yes  [ ] no
    

(It was in amongst a bunch of other settings that also had those same 3
options)

As can be seen by simply mentally timing the delay, and also by the "in
0.47s", queries seem to be being cached. That sort of admits that the cache
isn't _that_ up-to-date.

Also, I tried my standard "bamboozle google" query, "ketmax", while looking
for results suffixed with "35", "." and "zip". No dice. (ixquick eventually
found it that one time.)

~~~
mojeek
Hi, thanks for the feedback.

I'll most likely be changing the preferences back to a more traditional yes/no
soon! It was meant as a clear setting and use whatever the default is for
future searches (which may change).

Yes the results will be cached for up to 1 hour which is why subsequent
searches didn't take as long.

I'm limited to how much I can index, so ZIP files have been skipped for now.

~~~
i336_
Wow, cool to see the author checking in! :)

Thanks for clarifying the purpose of the "clear" preference. There's no
description underneath that option (and only that option, making it seem like
an outlier that was supposed to be removed or something) so it was far too
easy to assume there was a glitch somewhere, eg a template fault (that made
that option get the same set of radio buttons as the three elements underneath
it).

Ah, the cache lasts an hour. That's only four times longer than Google's
turnaround time (they aim to have data from most sites into search results
pages within 15 minutes), so all things considered that's not shabby at all.
(I'm not saying "only" here sarcastically)

I'm curious as to the indexing limitation, which I presume directly translates
to current storage capacity. This may ( _may_ ) be interesting: GitLab
recently asked Hacker News for advice about a systems upgrade
[https://news.ycombinator.com/item?id=13153031](https://news.ycombinator.com/item?id=13153031)
and received many, many comments. This forum is no replacement for the
existing partnerships you've forged over the last 13 years but it's possible
that networking on here _may_ be interesting. Not presuming; just putting a
datapoint out there. (The main caveat is that GitLab is within the collective
on here. I can't see why Mojeek couldn't be too though.)

Now for a bunch of other things.

I'm very curious how you're doing text analysis for the search entries,

How are you figuring out the (full-resolution) result counts (like
"18,342,669" \- not "16.8 million") for the "from $resultcount in $time" bit?
This is something Google continually gets wrong!

Most my interest stem from impressedness at the fact that you're using C for
all of this (which absolutely makes sense; it's cheaper to run!); my curiosity
is specifically about how you're doing those things from C, how you're
managing inevitable crashes, etc etc.

I'm also curious what database you're using (and also the query parsing
architecture) - because searching is actually pretty fast (even from
Australia!).

Feel free to answer whatever you like here (highly appreciated!), but what
would be _really_ cool would be a multi-page writeup explaining everything in-
depth (basically everything that's non-competitive) and then posting the URL
as a new article here (it might take a couple goes till it "sticks" and people
see and upvote it). That would be very interesting to a lot of people on here,
I think.

Just a thought.

~~~
mojeek
Thanks, really appreciate the feedback! HN is actually my most frequented
site, I just don't talk much!

I did see that post at the time but will read it again. Index limitation is
basically just lack of servers, funds, etc.

Currently we're doing a full index search for every query and so know the
exact amount of hits within our index. This might change in the future though
and become "about".

No databases except for what I've written myself. It started as a hobby so
everything was from scratch from the beginning and just continued that way.

Yes thanks that's a very good idea. I'll try and put something together.

Marc

~~~
i336_
Thanks for piping up about this!

The article _may_ be interesting to go through but most of the comments were
discussing fairly stock-standard "popular"/hyped software stacks, so YMMV with
that specific data. Untangling all of the individual pieces into a coherent
picture was also fairly involved (I gave up). I mention the link solely to
reference the _idea_ that GitLab did get lots of hits and feedback, and that
you _may_ find it an interesting idea to consider networking on here regarding
server resources. It's just a thought, may not be useful.

Okay, so... a small continuous load of users is doing full index over 1
billion items within 500ms per request. That's... that needs to go into your
writeup, along with what your current usage load is like. Prepare for
inquiries and offers when you do your post!

Also, you're definitely going to have a _lot_ of interest in the database
system if it's homemade; your use-case (high read and query load, moderate
write load) is fairly widespread, and different implementations always lean
themselves toward being super-awesome at certain kinds of queries.

I'm not going to push the open-source idea myself, but you'll definitely have
a bit of clamouring. That will need to be worked out; if this site is your
most frequently read (cool) then you probably already have a good idea of the
pros and cons of open vs closed.

Very much looking forward to hearing about this, whenever it happens. Doesn't
need to be immediate by any means - comprehensive, in-depth analysis takes a
while, and if it's not rushed the results are very good.

~~~
mojeek
It's a very good idea and I definitely should do it anyway.

Most of the 500ms is generating the snippet, the search itself is usually much
quicker. The search is effected by the load more than generating the snippet
though. I'll try and get it all written down.

Thanks very much!

------
machopacho
[https://www.mojeek.com/search?q=ycombinator+news](https://www.mojeek.com/search?q=ycombinator+news)
vs.
[https://www.google.de/search?q=ycombinator+news](https://www.google.de/search?q=ycombinator+news)

Google's first result is on track, theirs is not. A full index is not
everything, your search algorithm needs to be on top as well.

~~~
mojeek
Hi, if you click the more results from news.ycombinator.com it is there but
for some reason the submit page is pushing it down a place. I'll look into it,
thanks.

------
zitterbewegung
Tried it out with searching for snapchat on news. All of the top results are
from last week but todays news of their 12% drop is hidden in the results?
Wouldn't newer news be more relevant
[https://www.mojeek.com/search?q=snapchat&fmt=news&news=1](https://www.mojeek.com/search?q=snapchat&fmt=news&news=1)
?

------
lacksconfidence
What is being used to tune the search results, if user tracking isn't
involved? I'm of course guessing here, but I would assuse no tracking means no
log of queries and no log of clicked results?

------
lqdc13
looks like it's worse than duckduckgo for programming-related searches.

~~~
nerdponx
This is the second "alt search engine" submission I've seen today. Normally
I'm in favor of healthy competition in the "open/non-evil" products space. But
search engines probably more than any other web-based application, benefit
from economies of scale.

So why bother starting another one when DuckDuckGo is already gaining mind and
market share? It just seems wasteful to re-index the Web and develop
algorithms for searching it when someone else is already doing it.

~~~
mojeek
Hi, I started developing Mojeek way before DuckDuckGo as a hobby project. We
also have our own search index unlike DDG, which I've only more recently had
the funds to start growing.

------
est
Any non-ASCII character search will redirect to home page.

------
alewis75
Great to see an English search engine where there is no tracking and
independent results are given!

------
j_s
Anyone have any luck with the CloudFlare leak search parameters?

------
sud0x3
Why would I use this over duckduckgo?

------
kevin2r
what are the back-end technologies used to make mojeek.com work?

~~~
mojeek
Hi, it's been developed from scratch in C by myself.

------
simplehuman
This was posted in Dec 2016

~~~
sctb
Thanks, we've updated the submission title.

