

Approximate search for LaTeX formulae - jamii
http://scattered-thoughts.net/one/1291/799313/731344

======
timtadh
His distance based near neighbor search algorithm is good but as I noted in an
email to him it performs at worst in O(|T|) distance computations where |T| is
the size of his text corpus. While this is fine, you can do better if you use
distance based indexing methods. Such methods will give you O(log_{k}(|T|))
distance computations to find the nearest neighbor. Where the k depends on the
index structure used. For the simplest structures, such as a vp-tree or gh-
tree, k is 2. However, for more advanced structures such as Bozkaya and M.
Ozsoyoglu's MVP Tree or Sergey Brin's (yes that Brin) GNAT, k can be much
higher.

~~~
jamii
Bear in mind that I have not one large corpus but ~8m short formulae. The
filtering stage removes most of these in O(L log N + KL) time, where K is the
number of results which match one or more fragments, N is the number of
formulae and L is the maximum formula length. In practice K is about 3x the
actual result set. So yes, in the worst case this is O(T) but in practice the
performance is very good.

I've used a variety of spatial index trees in previous versions but this
method improved perfomance by 80-100x on typical input. I think the main
problem is that the equation space has an incredibly high dimension and the
vast majority of pairs of formulae have no elements in common so most
comparisons give you little information.

~~~
timtadh
Good experimental results! The high dimensionality (or non dimensionality) of
the space can definitely hurt performance of spatial trees such as R-Trees,
however for distance trees what really kills you is the the metric to compute
the distance. There can be a couple of problems with it: 1) if it takes a long
time to compute that will obviously hurt you, 2) if the variance is low or if
some distances are meaningless the tree will be mishapped.

I was thinking as I was going over the article again that you could represent
you formulae as trees instead of strings. This might yield more
accurate/interesting distances. I recently used the PQ-gram distance[1] to
compare subsets of ASTs with great success. PQ-gram is a really fast
approximation of tree edit distance. I also tried using exact edit distance
with the Zhang-Shasha algorithm[2], but in general the PQ-gram was so much
faster the greater accuracy of ZSS didn't matter.

[1]
[http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.149....](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.149.7933)
[2] <https://github.com/timtadh/zhang-shasha>

~~~
jamii
> ... if the variance is low ...

Yes, thats a better way of putting it. The variance is so low that the tree
always ended up very deep. I put a lot of effort into tuning bucket sizes etc
but in the end it just didn't work out.

> ... you could represent you formulae as trees instead of strings ...

For some definitions of edit distance, the edit distance between two trees is
equivalent to the Levenshtein distance between the in-order traversals of the
trees. In any case, the tree structure of LaTeX itself tends to be quite flat
and non-useful and the the tree structure of the equations themselves is hard
to derive in general. ArXMLiv is working on this, among others, but its still
a hard research problem. If I could get my hands on the tree structure I would
be doing more sophisticated searches than edit distance, cf hoogle and
uniquation.

------
jamii
I would love to give proper attribution for this algorithm, but I can't
remember for the life of me where I found it. This is the 4th major redesign
of the core search algorithm and the first that I've been happy with.

