

Similarity search engine for Wikipedia - Ixiaus
http://www.smartwikisearch.com/

======
frig
There's an easily understandable intuitive understanding of the 1st
eigenvector (for the eigenvalue == 1) of the pagerank matrix (loosely
speaking: assuming you browse a-la the pagerank matrix, the ith component of
the (unit) eigenvector for the 1.0 eigenvalue is ~the same as lim n->infinity
of #(of visits to i-th page on the internet)/n; some details are getting
blurred (of course) but it's a helpful intuitive explanation).

Does anyone have any insight into intuitive interpretations of the
eigenvectors fpr the eigenvalues < 1? I'm drawing blanks.

(Edited: to be clear, I understand what the linked algorithm is doing and I
think it is very clever, and I am not surprised that it works reasonably well
for finding similar topics on wikipedia. I'm just at a blank for what it means
to say that 'the i-th component of the (unit) eigenvector for the k-th largest
eigenvalue is X'.)

~~~
diiq
I would reccomend looking at 'Proto Value Functions' (
[http://www.machinelearning.org/.../070_ProtoValue_Mahadevan....](http://www.machinelearning.org/.../070_ProtoValue_Mahadevan.pdf)
)[1]; they use eigenvectors of adjaceny matrices for reinforcement learning
--- but because gridworld has a simple adjacency martix, the resulting
eigenvectors are somewhat human-readable, and the pictures help a lot. If
you're really curious, I can probably bang up some python or MATLAB or
something.

The essential intuition is that higher numbered eigenvectors represent more
_local_ adjacency information --- so the first eigenvector deals with global
connectivity (the relative value of i-th and j-th component says something
about how interconnected nodes i and j are in terms of the whole graph). A
higher eigenvalue's corresponding vector has a higher local gradient, and the
direction of the gradient is less constant across the graph (there is not
necessarily a monotonic path from node to node) so says less about global
connectivity --- but has a higher granularity when examining local
connectivity.

1[Edited to warn that Mahadevan's papers are fantastically dense --- and you
don't need to understand the paper to gain intuition from the pretty pictures;
read the text at your own risk]

~~~
frig
Thank you for the informed response.

I think you meant this link:

[http://www.machinelearning.org/proceedings/icml2005/papers/0...](http://www.machinelearning.org/proceedings/icml2005/papers/070_ProtoValue_Mahadevan.pdf)

...reading now.

~~~
diiq
Yes. I apologize.

------
wheels
I feel like ours does a bit better. Here are the sample searches they suggest:

<http://pedia.directededge.com/article/PHP>

<http://pedia.directededge.com/article/Flower>

<http://pedia.directededge.com/article/Bee>

<http://pedia.directededge.com/article/Albert_Einstein>

~~~
bsaunder
How so?

I like their intersections (e.g. Abraham Lincoln, Robert Lee) and simple page
layout.

~~~
wheels
The intersections are interesting, but I feel like our engine does better at
sussing out what something really is -- e.g. that PHP is a programming
language, that Albert Einstein is a physicist.

~~~
bd
_"sussing out what something really is"_

Wouldn't this be easier done by simply scraping "Category" links on the page?
That's free human curated metadata right there.

~~~
wheels
We actually do that for generating tags, but those don't play into the
recommendations that are generated.

------
jsrn
for the algorithm that was used, see:

<http://www.smartwikisearch.com/algorithm.html>

