Hacker News new | past | comments | ask | show | jobs | submit login

Common Crawl is awesome. I wonder how complex it would be to run a Google-like frontend on top of it, and how good the results would be after a couple days of hacking...

Very and probably not very good (Compare Gigablast to Google as an example of why its hard). Not to take anything away from Common Crawl but crawling is often one of the easier things to build when creating a search engine. A crawler can be as simple as

for(listofurls) { geturl; add urls to listofurls; }

Doing it on a large scale over and over is a harder problem (which common crawl does for you) but its not too difficult until you hit scale or want realtime crawling.

Building an index on 210 TB of data however... Assuming you use Sphinx/Solr/Gigablast you are going to need about 50 machines to deal with this amount of data with any sort of redundancy. That's just to hold a basic index which is not including "pagerank" or anything (Gigablast is a web engine so it might have that in there not sure). You aren't factoring in adding rankers to make it a webs search engine, spam/porn detection and all of the other stuff that goes with it. Then you get into serving results. Unless your indexes are in RAM you are going to have a pretty slow search engine. So add a lot more machines to hold the index for common terms in memory.

If someone is keen to do this however here are a list of articles/blogs which should get you started (wrote this originally as a HN comment which got a lot of attention so made it into a blog post) http://www.boyter.org/2013/01/want-to-write-a-search-engine-...

Actually, not so simple. Sure, you can do simple crawling easily; but the hard part is to extract meaningful data from it. It's very easy to loop on many sites for instance. Protocol violations abound - some sites serve binaries as text/html, for instance.

What I heard about a smaller search engine was that web crawling is usually augmented with some manually added rules for various sites to prevent spoiling the database. Not a trivial task at all.

Doing queries is IMHO algorithmically much better understood, because it's a constrained problem. But getting information extracted out from the real world, with all the PHP and HTML "hackers", not so easy.

Which is one of the main reasons Google has no serious competition in search except possibly in China.

It is also why the rate of innovation in search isn't moving as fast as it can be moving.

If Google opened up (unlimited) web API access to their search interface, to say a large city for a year or two people would really get a taste of what innovation in search looked like.

And of course it would be in Google's interest cause search as a platform or marketplace is where the future of Google really lies. All the other advertising empire defending distractions like Android, Chrome and YouTube are really sideshows.

Personally I consider extracting meaning from the crawl part of the indexing step. That just comes down to how you define it though. In reality its blurry as you need to do some pre-indexing during the crawl to extract meaningful data and as you say there are a lot of edge cases.

For basic crawling it really is as simple as while links download link though.

The massive advantages that Google has include over a decade of data on the pages that people actually visited in response to a specific query as well as having an in-memory index of the public web, parts of which are updated on the order of seconds to minutes.

I wonder if there is a viable business in maintaining an in-memory & up-to-date index of the public web & selling access to it, with a pricing model that scales according to the amount of computation you are doing on it.

blekko has data customers playing us for that kind of thing.

It would be challenging. You've got a crawl, but one with a fair bit of spam in it, despite the donation of blekko metadata. Then you have to figure out ranking for keywords, something that the blekko metadata won't help you with at all.

Not very, or else someone would have replicated Google already (with fewer engineers and less money than Microsoft has thrown at the problem).

This is actually very small corpus, just 3 billion docs. Google, for example, is known to have 50 billion docs in index.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact