No, I actually count the n-grams as distinct words (up to 4-grams). The main lim... | Hacker News

Hacker News new | past | comments | ask | show | jobs | submit

login

marginalia_nu on Sept 21, 2021 | parent | context | favorite | on: A search engine that favors text-heavy sites and p...

No, I actually count the n-grams as distinct words (up to 4-grams). The main limiter is for that is space, so I only extract "canned" n-grams from some tags.

I would first search for the bigram hello_world, that's an O(1) array lookup; as then documents merely containing the words hello and world (usually not a good search result), that's the algorithm I'm describing in the parent comment.

soheil on Sept 21, 2021 [–]

Makes sense. Every time you insert a new URL for a word you have to update the ranges for every other word since the URL file will be shifted?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact