Hacker News new | past | comments | ask | show | jobs | submit login

I struggled with how to word that in a way that's both true, understandable, and doesn't give away any proprietary information. Added "indexed" to clarify but I didn't fix up the numbers, so they're likely an overestimate.

Basically, yes, it uses an index and touches only documents that appear in one of the relevant posting lists. However, after stemming, spell-correcting, synonyms, and a number of other expansions I'm not at liberty to discuss, there can be a lot of query terms that it needs to look through, covering a significant portion of the index. Each one of these needs to be scored (well, sorta - there are various tricks you can use to avoid scoring some docs, which again I'm not at liberty to discuss), and it's usually beneficial to merge the scores only after they have been computed for all query terms, because you have more information about context available then.

There's a reason Google uses an in-memory index: it gives you a lot more flexibility about what information you can use to score documents at query time, which in turn lets you use more of the query as context. With an on-disk index you basically have to precompute scores for each term and can only merge them with simple arithmetic formulas.




> Basically, yes, it uses an index and touches only documents that appear in one of the relevant posting lists. However, after stemming, spell-correcting, synonyms, and a number of other expansions I'm not at liberty to discuss, there can be a lot of query terms that it needs to look through, covering a significant portion of the index.

But, reading through the other comments, leaving out this part would make it better than Google.

Maybe stemming. I remember when Google added stemming (somewhere in the early 2000s). I was conflicted about it because I didnt want a search engine to second-guess my query (can you imagine??), but I also saw the use because I was already in the habit of trying multiple variations.

Auto spelling correct is a no-no. Just say "did you mean X?" and let people click it if they misspelled X. No sense in querying for both the "typo" and "corrected" keywords, because the "typo" would rank much lower, right?

Similar for synonyms. Either it should be an operator like ~, or maybe it should just offer a list (like the "did you mean" question) of synonyms to help the user think/select similar words to help their query.


> Each one of these needs to be scored (well, sorta - there are various tricks you can use to avoid scoring some docs, which again I'm not at liberty to discuss)

You mean like Wand or BMW?




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: