
Ask HN: Classify relatedness between a query and a document - Skinish
Hi there!
I am looking forward to build a classifier for the field of Information Retrieval!
The purpose of the classifier is to identify whether a query is related or not to a document. The document in here can be viewed as an article aka &#x27;big huge text&#x27;.<p>I am here asking for ideas in many fields:<p>1. Do you know any work done previously in identifying the relatedness of a claim worth mentioning?
2. Do you know any datasets that I could use to train my model? maybe a similar Kaggle competition?<p>Thanks in advance!
======
brudgers
The industry standard tool is probably Apache Lucene [1]. It uses a well
documented mechanism and is open source so it would make a good place to start
research.

Good luck.

[1]:
[https://en.wikipedia.org/wiki/Apache_Lucene](https://en.wikipedia.org/wiki/Apache_Lucene)

~~~
Skinish
Thanks, but Apache Lucene is a search engine. I want to build a relatedness
classifier

~~~
PaulHoule
The mainstream of IR research has deliberately avoided the development of a
"relatedness classifier" in the sense of a system that produces a probability
score that a document is relevant. The TREC organizers have been mostly
interested in "get good recall" and less interested in P@1 or rewarding the
kid who sticks his hand up really high when it's an easy question and knows
the answer.

You can tune up a conventional IR function with these kind of methods

[http://scikit-learn.org/stable/modules/calibration.html](http://scikit-
learn.org/stable/modules/calibration.html)

with the data that TREC publishes. The results are a little disappointing
because the system will probably never predict more than a 70% probability of
relevance.

I am interested in the promise of "Siamese networks" for learning relatedness,
see

[https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf](https://www.cs.cmu.edu/~rsalakhu/papers/oneshot1.pdf)

for example. For document-document matching I think this approach would be
great (and supportive of relevance feedback, something that mainstream IR
algorithms have never done that well at) Query-document matching may require
something similar but different, since there is an asymmetry between queries
(short) and documents (long.)

Click on my HN profile link and we can talk more. Thanks!

~~~
Skinish
Thank you for such a detailed answer! I'll get in touch you tomorrow then, for
sure :)

