Hacker News new | past | comments | ask | show | jobs | submit login
The Weaknesses of Full Text Searching (2008) [pdf] (rutgers.edu)
29 points by lemonspat 3 months ago | hide | past | favorite | 7 comments

As a (junior) patent examiner, the weaknesses of text search were discussed in my training and have become very clear over time. Many people today think that text search is the "be-all and end-all" of search, but if one wants to be comprehensive, text search should only be one part of a search strategy. Other major components include citation search (forwards and backwards) and classification search.

Google can identify many synonyms today, but my experience has been Google frequently misses important synonyms. I've started compiling lists of synonyms and even partial search queries (medical searches call these "hedges") to use when searching. The problem of synonyms is one place where citation and classification search shine, as they are independent of the terminology used (and even language independent in the case of a classification like the IPC). There's no one "best" approach; each of these approaches complement each other. And you can do a "combination" search, e.g., of all the documents citing this document, return all that contain a keyword.

Unfortunately classification search has fallen out of favor among the general population, but I can see systems like the Dewey Decimal System being extremely useful when the terminology in a field varies appreciably. Classification search is extremely useful in my work.

When I have the time I'll take a close look at this article. Thanks for posting it.

At least w.r.t. to the synonym problem, more modern search techniques that rely on language modeling and word representations seem to solve that.

Is there a Elasticsearch-like server that uses language models? Or does it require a sophisticated team to tune and host?

There are a few things in this space that might be interesting:

- https://github.com/Hironsan/bertsearch

- https://github.com/hanxiao/bert-as-service

There was another I can't find right now that looked more polished/professional. But, in short, no it's pretty easy to setup. Just need a machine with a pretty big disk and be ok with an index latency. If you're using elastic search then you're already there on both accounts!

Do you have some recent examples?

The most recent understandable implementation of this is something called Word2Vec. It is not the state of the art (it's actually suboptimal now by a lot) but there's a LOT of explanations of it you can find that are great.

A decent video covering it from a programmers perspective (where I'm coming from) https://www.youtube.com/watch?v=LSS_bos_TPI

From a math perspective there's also good materials.

Essentially Apple + Tree != Apple + Computers allowing you to differentiate search like this.

In theory there is amazing progress in retrieval, but in practice we have Google. Maybe their motives are not aligned with search improvement after all.

The problem of meaning disambiguation has been solved with neural nets to a much higher degree than it appears in Google's search engine.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact