Interesting I would have thought that crawling at this scale and finishing in a reasonable amount of time would still be somewhat challenging. Might you have any suggested reading for how this is done in practice?
>"It costs about $50 to grep over it, $800 or so to run a moderately complex Hadoop job."
Curious what type of Hadoop job you might referring to here. Would this be building smaller more specific indexes or simply sharding a master index?
>"Google hasn't used PageRank since 2006."
Wow that's a long time now. What did they replace it with? Might you have any links regarding this?
$50 gets you basically a Hadoop job that can run a regular expression over the plain text in a reasonably-efficient programing language (I tested with both Kotlin and Rust and they were in that ballpark). $800 was for a custom MapReduce I wrote that did something moderately complex - it would look at an arbitrary website, determine if it was a forum page, and then develop a strategy for extracting parsed & dated posts from the page and crawling it in the future.
A straight inverted index (where you tokenize the plaintext and store a posting list of documents for each term) would likely be more towards the $50 end of the spectrum - this is a classic information retrieval exercise that's both pretty easy to program (you can do it in a half day or so) and not very computationally intensive. It's also pretty useless for a real consumer search engine - there's a reason Google replaced all the keyword-based search engines we used in the 80s. There's also no reason you would do it today, when you have open-source products like ElasticSearch that'd do it for you and have a lot more linguistic smarts built in. (Straight ElasticSearch with no ranking tweaks is also nowhere near as good as Google.)