Hacker News new | past | comments | ask | show | jobs | submit login

Crawling is tricky but it's been commoditized. CommonCrawl does it for free for you. If you need pages that aren't in the index then you need to deal with all the crawling issues, but its index is about as big as the one most Google research was done on when I was there.

$50 gets you basically a Hadoop job that can run a regular expression over the plain text in a reasonably-efficient programing language (I tested with both Kotlin and Rust and they were in that ballpark). $800 was for a custom MapReduce I wrote that did something moderately complex - it would look at an arbitrary website, determine if it was a forum page, and then develop a strategy for extracting parsed & dated posts from the page and crawling it in the future.

A straight inverted index (where you tokenize the plaintext and store a posting list of documents for each term) would likely be more towards the $50 end of the spectrum - this is a classic information retrieval exercise that's both pretty easy to program (you can do it in a half day or so) and not very computationally intensive. It's also pretty useless for a real consumer search engine - there's a reason Google replaced all the keyword-based search engines we used in the 80s. There's also no reason you would do it today, when you have open-source products like ElasticSearch that'd do it for you and have a lot more linguistic smarts built in. (Straight ElasticSearch with no ranking tweaks is also nowhere near as good as Google.)

Thanks for the detailed response. I appreciate it. I will look into CommonCrawl. Cheers.

Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact