Hacker News new | past | comments | ask | show | jobs | submit login

That I'm aware of there are only three places to get actual search index results in the US, Google, Bing (via Yahoo BOSS), and of course Blekko. You can also get index results from both Yandex (Russian) and Baidu (Chinese) but both serve them out of data centers in either Amsterdam or Asia respectively so latency is a bit longer.

With modern hardware (especially 3TB drives) you can crawl a significant chunk of the Internet with as few as a hundred machines but in practice you will want closer to 300. Of course crawling is but one (relatively small) step in the series of steps needed to create an index, next up is page extraction, then term extraction, then semantic analysis followed by meta-data decoration (is it likely porn or maybe malware or junk, did it change since the last time you looked at it, what is a good snippet, etc etc) Then you process link information and perhaps click stream data and create your ranking model, which then lets you build an index. At which point you need another bunch of machines which can efficiently take a query string and put together a set of documents that might be a good fit (straightforward for 1 term, harder for 2 terms, exponentially worse for 3, 4, 5, 6 or more terms).

Depending on latency requirements and load you make that part of the system in another few hundred machines.

Search results though also have an inverse power law associated with them, the most popular consumer results fit in a .5 billion URL index, the 95th percentile in a 3 - 5 billion URL index, the 99th percentile is probably closer to 45 - 50 billion URL index. At the 95th percentile you're up to a few thousand machines to hold it all.

It isn't really possible to build on a generic VPS yet (like ECS on steroids) without having latency being ridiculous. It would also be expensive, the last time I did the calculation for the memory/compute/storage resources (pretending the latency could be fixed) it was about $2.5M/month in resource fees for a 1/2 billion page index and crawl. Since that time both Amazon and Google have lowered some prices so it may be more affordable now.

Putting it in a data center is much cheaper than that but not quite cheap enough to just let it sit around without monetizing it somewhat.

I'm a bit surprised that someone hasn't put something together, there are bits and pieces out there. But even self-hosted you're looking at least a few hundred thousand $ a month to keep it up and since Yahoo will sell you access to their search API for a couple of bucks per thousand queries it's probably not worth it for most people to try and run their own.

A few years ago, before JumpTap pivoted to be a mobile ad network, they built a search engine using off-the-shelf parts on a hosting service. Some of the founding team members were from Lycos, and they had a pretty good idea of how to do this.

It was not run at general-purpose production scale, but they did show they could produce credible search results at a relatively small scale.

There are several open source spiders out there. As Web search portals increasingly aim at the broad middle, I sometimes wonder if doing your own domain-specific spidering might find interesting things.

I guess it would make more sense to let an organization specify sets of interest. Optionally fail-through to external public search engines if necessary.

e.g. if I were an intel agency, I'd crawl all the jihadi sites and build a special translation/search tool for those (I'm sure they've done this). No need to crawl the rest of the pornosphere, so it could be done on one box. A commercial company could search all their industry sites locally, and then fail-through for general stuff.

I guess my main question is "how many machines are needed to do 1-3 term search on a 0.005-0.5 billion URL index" for one user with decent latency, and then how many simultaneous users could be served from that, after the index is delivered". The idea being that organizations will trust the security guarantees of a machine they own, so you could build a search service in a way that the crawl/etc. is done by one set of shared machines, bulk data streamed to everyone inside the datacenter, certain data (up to the entirety) retained by each end-user organization's query cluster, and then queries done by that cluster locally to keep search terms secret.

I assume the actual query part could fit on single machines -- you would sort of want to keep cached pages for a lot of the things you'd want to keep searches secret for, but disk is cheap.

In that particular case it gets a lot easier of course. One of the challenges of "general" search is that you're trying to satisfy a wide variety of queries, for domain specific search you can be much more selective of what you pick. Also one of the 'knobs' if you will is change in the corpus, so if you index Wikipedia for example it doesn't change all that much and that is a pretty stable index, versus if you index say a few tens of thousands of blogs then they will get new articles to index once every .5 to maybe 3 days. Versus keeping an eye on the "news" sites which get perhaps a dozen articles a day added to them.

Then if you don't keep the original extractions around, just holding the index can be done on a relatively small number of machines (say 5 or 10) and for a small user population that is pretty doable.

Google sold its "Google Appliance" which did this for businesses and often that was less than 1 cabinet or rack worth of machines.

Given some of the silly things people try to scrape Blekko for (clearly domain specific things) I'm surprised there aren't more folks building such engines. It isn't hard to write a basic crawler, it takes a lot of coaching to get it through the initial crawl since if you just blindly follow links around you can end up in endless loops on some sites or in a quasi form submission loop (we call those Crawler traps, for the most part unintentional on the part of web sites [1])

And of course we have been supporting the Common Crawl efforts to make this more generally available. We gave them a bunch of indexing data that made the Crawl they had a lot more useful as an example. So at some point we may reach critical mass and get that going.

[1] Sometimes though folks use an intentional crawler trap to find 'stealth' scrapers, which are folks making their own crawlers and not respecting robots.txt.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact