

What domain of knowledge does my problem fall under? - voidfiles

My business partner, and I have created a site; http://www.qwertykitchen.com . Its a recipe site. Our main form of search is entering ingredients that you have on hand and you get a list of recipes that contain those ingredients.<p>We are now looking at doing something more then just full text searching, and keywords. The first problem we want to attack is making words that are basically the same, show up in the search when the other is called.<p>For example, we have egg and eggs, or milk, 2% milk, whole milk, and skim milk. We want someone to only have to enter milk. If they do enter milk, then the entire cluster should be searched.<p>My idea of an approach to this problem is this. We search for longest common substring. Then create a directed graph representation. Then when a person searchs for a term, we use a proximity measure and then search for all the terms with in a certain proximity.<p>So, I was wondering one, does anyone know if their is a name for what I am talking about, and two, does anyone maybe have a better idea of how to do something like this.
======
jdale27
The domain of knowledge you're interested in is "information retrieval" (IR).

Here's a textbook on the subject from some of the top minds in the field:
[http://www-csli.stanford.edu/~hinrich/information-retrieval-...](http://www-
csli.stanford.edu/~hinrich/information-retrieval-book.html)

I am not an IR expert, but one particular algorithm that you might find useful
is latent semantic analysis, a.k.a. latent semantic indexing:
<http://en.wikipedia.org/wiki/Latent_semantic_analysis>

------
Someone
> We search for longest common substring.

That idea can be improved upon: "length of largest common substring" is not
the measure to use. At the least, you want to use Levenshtein distance
<[http://en.wikipedia.org/wiki/Levenshtein_distance>](http://en.wikipedia.org/wiki/Levenshtein_distance>),
but even that can be tweaked to weigh common typing 'errors' lower (for
example, colour is closer to color than to clour)

> Then when a person searchs for a term, we use a proximity measure and then
> search > for all the terms with in a certain proximity

This is a form of
<[http://en.wikipedia.org/wiki/Query_expansion>](http://en.wikipedia.org/wiki/Query_expansion>);

------
michael_dorfman
What you are asking about is fuzzy searching, but the fact that you are asking
at all indicates that you ought to read up on searching theory. There's a
large body of work out there that you ought to familiarize yourself with.

~~~
voidfiles
do you have specific suggestions on what to read?

~~~
michael_dorfman
Recent issues of the ACM Transactions on Information Systems will bring you
the latest research; alternately, if you want to start at the beginning, Knuth
volume 3.

------
mooneater
Eggs and egg is an example of stemming, see
<http://en.wikipedia.org/wiki/Stemming>

The milk example as stated could be solved using fulltext indexing and
searching.

I think your approach might work, but you could get results much more quickly
using existing database fulltext search.

~~~
voidfiles
I am not convinced that just fulltext search would work. Wouldn't fulltext
search fail in the case of goat. You have goats milk, goats milk cheese, goat
meat. So if someone searched goat all three types would be returned?

~~~
technoguyrob
Well, if someone claims to have a goat, they probably have all three of those,
or can at least make them. ;-)

~~~
akd
not if it's a boy goat

------
jauco
Try reading up on latent semantic analysis. It also uses vector calculus, but
works a bit different IIRC. It can be used to identify words that are used in
the same context and therefore should be the same. I'm not sure if it is
directly applicable for recipes because of the concise language, but you might
find interesting related articles.

