
“The world's most private search engine” - lemming
https://www.ixquick.com/
======
rdl
I wonder what it takes to host a full search engine for the Internet now, if
an organization wanted genuinely private search. Presumably you can buy crawls
from somewhere, but does anyone sell a self-hostable general web search
engine?

~~~
ChuckMcM
That I'm aware of there are only three places to get actual search index
results in the US, Google, Bing (via Yahoo BOSS), and of course Blekko. You
can also get index results from both Yandex (Russian) and Baidu (Chinese) but
both serve them out of data centers in either Amsterdam or Asia respectively
so latency is a bit longer.

With modern hardware (especially 3TB drives) you can crawl a significant chunk
of the Internet with as few as a hundred machines but in practice you will
want closer to 300. Of course crawling is but one (relatively small) step in
the series of steps needed to create an index, next up is page extraction,
then term extraction, then semantic analysis followed by meta-data decoration
(is it likely porn or maybe malware or junk, did it change since the last time
you looked at it, what is a good snippet, etc etc) Then you process link
information and perhaps click stream data and create your ranking model, which
then lets you build an index. At which point you need another bunch of
machines which can efficiently take a query string and put together a set of
documents that might be a good fit (straightforward for 1 term, harder for 2
terms, exponentially worse for 3, 4, 5, 6 or more terms).

Depending on latency requirements and load you make that part of the system in
another few hundred machines.

Search results though also have an inverse power law associated with them, the
most popular consumer results fit in a .5 billion URL index, the 95th
percentile in a 3 - 5 billion URL index, the 99th percentile is probably
closer to 45 - 50 billion URL index. At the 95th percentile you're up to a few
thousand machines to hold it all.

It isn't really possible to build on a generic VPS yet (like ECS on steroids)
without having latency being ridiculous. It would also be expensive, the last
time I did the calculation for the memory/compute/storage resources
(pretending the latency could be fixed) it was about $2.5M/month in resource
fees for a 1/2 billion page index and crawl. Since that time both Amazon and
Google have lowered some prices so it may be more affordable now.

Putting it in a data center is much cheaper than that but not quite cheap
enough to just let it sit around without monetizing it somewhat.

I'm a bit surprised that someone hasn't put something together, there are bits
and pieces out there. But even self-hosted you're looking at least a few
hundred thousand $ a month to keep it up and since Yahoo will sell you access
to their search API for a couple of bucks per thousand queries it's probably
not worth it for most people to try and run their own.

~~~
rdl
I guess it would make more sense to let an organization specify sets of
interest. Optionally fail-through to external public search engines if
necessary.

e.g. if I were an intel agency, I'd crawl all the jihadi sites and build a
special translation/search tool for those (I'm sure they've done this). No
need to crawl the rest of the pornosphere, so it could be done on one box. A
commercial company could search all their industry sites locally, and then
fail-through for general stuff.

I guess my main question is "how many machines are needed to do 1-3 term
search on a 0.005-0.5 billion URL index" for _one_ user with decent latency,
and then how many simultaneous users could be served from that, after the
index is delivered". The idea being that organizations will trust the security
guarantees of a machine they own, so you could build a search service in a way
that the crawl/etc. is done by one set of shared machines, bulk data streamed
to everyone inside the datacenter, certain data (up to the entirety) retained
by each end-user organization's query cluster, and then queries done by that
cluster locally to keep search terms secret.

I assume the actual query part could fit on single machines -- you would sort
of want to keep cached pages for a lot of the things you'd want to keep
searches secret for, but disk is cheap.

~~~
ChuckMcM
In that particular case it gets a lot easier of course. One of the challenges
of "general" search is that you're trying to satisfy a wide variety of
queries, for domain specific search you can be much more selective of what you
pick. Also one of the 'knobs' if you will is change in the corpus, so if you
index Wikipedia for example it doesn't change all that much and that is a
pretty stable index, versus if you index say a few tens of thousands of blogs
then they will get new articles to index once every .5 to maybe 3 days. Versus
keeping an eye on the "news" sites which get perhaps a dozen articles a day
added to them.

Then if you don't keep the original extractions around, just holding the index
can be done on a relatively small number of machines (say 5 or 10) and for a
small user population that is pretty doable.

Google sold its "Google Appliance" which did this for businesses and often
that was less than 1 cabinet or rack worth of machines.

Given some of the silly things people try to scrape Blekko for (clearly domain
specific things) I'm surprised there aren't more folks building such engines.
It isn't hard to write a basic crawler, it takes a lot of coaching to get it
through the initial crawl since if you just blindly follow links around you
can end up in endless loops on some sites or in a quasi form submission loop
(we call those Crawler traps, for the most part unintentional on the part of
web sites [1])

And of course we have been supporting the Common Crawl efforts to make this
more generally available. We gave them a bunch of indexing data that made the
Crawl they had a lot more useful as an example. So at some point we may reach
critical mass and get that going.

[1] Sometimes though folks use an intentional crawler trap to find 'stealth'
scrapers, which are folks making their own crawlers and not respecting
robots.txt.

------
crdoconnor
I've noticed while using this that the results are often very different to
google's (and generally inferior).

Sometimes I find myself switching back to "regular" google, but not as often
as when I used duck duck go.

One thing that really bugs me is that I can't do currency conversions or
calculations in this one.

~~~
stinos
this, plus the lack of completion in the search box.

------
qwertzlcoatl
I usually use DuckDuckGo.com, because they have a neat toolbar and I want to
support their cause, but everytime I don't get the results I want through DDG
(which happens rarely) I just type !sp + searchterm into my search bar and it
searches via startpage.com, which uses googles search results without my IP
address or searches being recorded, no identifying or tracking cookies are
used and SSL encryption is set by default.

Startpage.com was launched in 2009 by ixquick to create the same service via a
URL that is both easier to remember and spell.

~~~
Sami_Lehtinen
Wrong, it is not same service.

~~~
drivebyacct2
>[https://startpage.com/eng/aboutstartpage/](https://startpage.com/eng/aboutstartpage/)

Start page by ixquick.

~~~
Sami_Lehtinen
Nice quote, but you completely missed the point. How about just comparing the
results, so it's very clear it's not same service as ixquick. It's same
company, but not same service.

~~~
drivebyacct2
My mistake, I genuinely missed the "enhanced with Google part".

------
twapi
Scroogle was a similar and quite popular site, but Google blocked it.

~~~
wielebny
Context:
[http://en.wikipedia.org/wiki/Criticism_of_Google#Scroogle](http://en.wikipedia.org/wiki/Criticism_of_Google#Scroogle)

------
electrichead
Does that mean it searches private info? Or does it mean that nobody uses it?
Maybe it means that the searches are anonymized?

~~~
andrewfhart
The last. It basically acts as a proxy: You type in your query, and IxQuick
(aka startpage.com) queries Google on your behalf, thereby shielding you from
the associated analytics and tracking. It also claims to not store the IP
address associated with an individual query.

Edit: more info here:
[https://ixquick.com/eng/aboutixquick/](https://ixquick.com/eng/aboutixquick/)

------
Sami_Lehtinen
[http://startpage.com](http://startpage.com) serves also Google results.

