API access to either the unranked or ranked index in memory wouldn't do anything useful, BTW. To have a viable startup you need something a lot better than Google, which means that you need algorithms that do something fundamentally different from Google, which means you need to be able to touch memory yourself and not go through an API for every document you might need to examine. Remember, search touches (nearly) every indexed document on every query - if you throw in 200ms request latency for 4B documents your request will take roughly 25 years to complete.
Knowledge Graph is already public - it was an open dataset before it was bought by Google, and a snapshot of its state at the point Google closed it to further additions is still hosted by Google:
(It's only 22G gzipped, too - you can download that onto a personal laptop.)
Doesn't it only touch ones with at least one of the search terms in, or stemmed/varied words relating to some of the terms? And does that via an index?
Basically, yes, it uses an index and touches only documents that appear in one of the relevant posting lists. However, after stemming, spell-correcting, synonyms, and a number of other expansions I'm not at liberty to discuss, there can be a lot of query terms that it needs to look through, covering a significant portion of the index. Each one of these needs to be scored (well, sorta - there are various tricks you can use to avoid scoring some docs, which again I'm not at liberty to discuss), and it's usually beneficial to merge the scores only after they have been computed for all query terms, because you have more information about context available then.
There's a reason Google uses an in-memory index: it gives you a lot more flexibility about what information you can use to score documents at query time, which in turn lets you use more of the query as context. With an on-disk index you basically have to precompute scores for each term and can only merge them with simple arithmetic formulas.
But, reading through the other comments, leaving out this part would make it better than Google.
Maybe stemming. I remember when Google added stemming (somewhere in the early 2000s). I was conflicted about it because I didnt want a search engine to second-guess my query (can you imagine??), but I also saw the use because I was already in the habit of trying multiple variations.
Auto spelling correct is a no-no. Just say "did you mean X?" and let people click it if they misspelled X. No sense in querying for both the "typo" and "corrected" keywords, because the "typo" would rank much lower, right?
Similar for synonyms. Either it should be an operator like ~, or maybe it should just offer a list (like the "did you mean" question) of synonyms to help the user think/select similar words to help their query.
You mean like Wand or BMW?
That dump is outdated, not supported, and very incomplete comparing to what google has now.