

A new experimental "similarity search" algorithm. Thoughts? - eserorg
http://eser.org

======
eserorg
Currently, we're text-mining the english version of Wikipedia.

There's a lot of room for improvement: optimizing for speed and pruning down
the results are at the top of our "TODO" list.

Also, the UI is simplistic -- that's because we've been spending 99% of our
time working on the algorithm in matlab.

But, we wanted to get something out -- warts and all -- to get some feedback
on the general idea.

We'd value any feedback -- positive or negative.

~~~
gaika
Pretty cool, what is the kind of math behind it? PLSA?

~~~
eserorg
Yes.

At this point we're really constrained by the number of cores we're running
on.

Once we can get a hold of some more servers, we should be able to drastically
improve the performance and prune many of the results.

We'd also like to run the algorithm on additional corpora. Specifically: (1)
the US patent database (back to 1975); and (2), a collection of United States
federal and state case law (the JURIS database).

------
uuilly
For what it's worth I searched for "google.com" and got "The International
Society for Cryptozoology" as a top hit. Not sure how useful it was but I did
learn that there was such a thing as Crytozoology and my day instantly got 10%
better.

I actually use this one all the time: <http://www.similicio.us/>

I think it uses delicious tags.

------
andr
I tried Bach and it appears that it just returned Wikipedia articles that
mentioned Bach. My initial expectations were more along the lines of Google
Sets.

What are the shapes on the homepage for? It was kind of intriguing that they
seemed to do something, but I didn't know what.

~~~
eserorg
You have to click on the bold links to "drill-down". So, if you click on
"johann sebastian bach", it will give start to give you results such as:
"Vivaldi's Cello" and "History of Germany"

Currently, we're limited to running on one server. Therefore, the algorithm is
restricted to running on the english Wikipedia corpus.

It appears that the "skip slope" metaphor with the shapes was a bad idea.

Each shape is a ski-slope difficulty rating symbol. So, Green -- "easy" --
will take the meaning of your query more literally. Double black diamond --
"advanced expert" -- will try to extrapolate hidden meanings in your query. It
will suggest topics that are less obviously related to your query.

~~~
andr
The difficult rating thing is really really clever! I wish Google had that.
Only problem is that I usually ski near home (in Europe) and the convention
over there is colors (green, blue, red and black) instead of shapes. Still,
kudos for that feature!

~~~
eserorg
Good information. I didn't know that. Also, not everyone ski's.

I'm trying to think of a way to make the functionality of the shapes more
obvious.

------
okeumeni
Besides the fact that it thinks too long, what is new about it, what are you
trying to achieve?

I search for "Test" and results were not really relevant.

~~~
eserorg
Thank you for trying it out.

Re: performance \- It's mining through ~40gb of data on server with 8gb of
ram. \- Also, we're not using caching of search results -- it computes on-the-
fly for each query. \- If we can get a hold of more servers, we should be able
to bring down the query time below 1 second.

Re: query "Test" \- You have to search for something you're interested in.

~~~
okeumeni
From my own experience 40GB and 8GB RAM is very good it should be enough for
better performance. I don’t think you need more servers at this time.

Think about it in order for you guys to have a meaningful search engine you
will need data in the TB range how many servers will you need then? Spend more
time fine tuning your search algorithm and processing you should get better
performance out of what you have now. Then your repository will grow
proportionally to your resources and you should be fine.

When I said the search for 'test' did not return good result I meant you
should do more work on relevancy.

~~~
eserorg
It's a very tough problem. Queries such as "square", "blue", "fast", etc...
will yield very poor results.

PLSI tends to perform very well on more specific queries, such as "Paul
Graham", "silicon graphics", etc...

The problem with PLSI is that it is extremely computationally expensive --
which is why most internet-scale search engines don't use it.

Our innovation was figuring out some tricks that have allowed us to improve
performance dramatically. However, there is obviously still room for
improvement.

Our goal is to satisfy 80% of the queries with decent results -- and to leave
the other 20% (square, etc...) to someone else.

The interesting thing about PLSI is that it's able to rank documents from the
text alone -- ignoring the link structure and other metadata.

Therefore, we're thinking our algorithm will make the most sense in situations
where there is lots of textual data without web-like link metadata.

The two scenarios that come to mind where people need to text-mine documents
outside the metadata-rich web are: (1) windows file shares on corporate
intranets (2) large volumes of legal documents inside law firms

Text-mining wikipedia is a proof-of-concept at this point

------
vzn
It like google with schizophrenia. Pretty impressive. It seems like drugs and
alcohol as mind enhancer for creative people (writers, designers, painters) is
obsoleted now :-)

