
What is TF-IDF? - helium
http://michaelerasm.us/tf-idf-in-10-minutes/
======
gibrown
Lucene is moving away from TF-IDF to BM25 as the default. Pretty similar idea,
but tends to performs a better with short content.

[https://issues.apache.org/jira/browse/LUCENE-6789](https://issues.apache.org/jira/browse/LUCENE-6789)

[https://en.wikipedia.org/wiki/Okapi_BM25](https://en.wikipedia.org/wiki/Okapi_BM25)

In the very limited test cases where I've compared them it hasn't mattered
much, but other's results are pretty compelling.

[https://www.elastic.co/blog/found-bm-vs-lucene-default-
simil...](https://www.elastic.co/blog/found-bm-vs-lucene-default-similarity)

~~~
meeper16
Vector Space replacing or being combined with TF-IDF approaches is new way of
summarizing and searching for meaning in documents...

[http://52.11.1.7/TuataraSum/example_context_control-
ml2.html](http://52.11.1.7/TuataraSum/example_context_control-ml2.html)

~~~
gibrown
Interesting. This basically uses the background word2vec data for the entire
Web to provide more information and help with things like disambiguation,
synonyms, etc? Am I understanding that correctly?

Maybe nit-picky thought, but its not clear to me that the TF-IDF part is
what's doing a lot of extra lifting there.

Do you know of any good evaluations between using vector space data and other
methods for summarization?

~~~
meeper16
Word2Vec was a fork or based on a more exhuastive vector space approach here
[https://www.kaggle.com/c/word2vec-nlp-
tutorial/forums/t/1234...](https://www.kaggle.com/c/word2vec-nlp-
tutorial/forums/t/12349/word2vec-is-based-on-an-approach-from-lawrence-
berkeley-national-lab)

I've compared the summarization to others like OTS
[http://libots.sourceforge.net/](http://libots.sourceforge.net/) which I
believe strictly relies on TF-IDF and it seems better and allows for context
to control the summarization.

Other similar approaches might be based on Latent Semantic Analysis, Latent
Semantic Indexing or LDA.

~~~
gibrown
Thanks for the links!

------
rohwer
Translate IDF to "how uncommon is this word in the corpus?"

TF-IDF is acronym soup, but mathematically simple: IDF is a scalar applied to
a term's frequency. And in the comparison, the numerator is the document
overlap score and the denominator is the square root of the two documents. For
more, Stanford's natural language processing course is the bee's knees:
[https://class.coursera.org/nlp/lecture/preview](https://class.coursera.org/nlp/lecture/preview)

------
nathell
TF-IDF solves an important problem and it's good to know about.

However, in some applications, such as Latent Semantic Analysis (LSA) and its
generalizations, there are practical alternatives such as log-entropy [1] that
I've found to work better in practice.

[1]:
[http://link.springer.com/article/10.3758%2FBF03203370#page-1](http://link.springer.com/article/10.3758%2FBF03203370#page-1)

------
rhema
Here's an interesting demo I made where you can type or paste in words to get
a sense of their IDF (
[http://tpoem.com/test/dict/test_dictionary.html](http://tpoem.com/test/dict/test_dictionary.html)
).

------
meeper16
It's also used in AI-based document summarization systems that are worth
millions e.g.

Yahoo Paid $30 Million in Cash for 18 Months of Young Summly
[http://allthingsd.com/20130325/yahoo-paid-30-million-in-
cash...](http://allthingsd.com/20130325/yahoo-paid-30-million-in-cash-
for-18-months-of-young-summly-entrepreneurs-time/)

Google Buys Wavii For North Of $30 Million
[http://techcrunch.com/2013/04/23/google-buys-wavii-for-
north...](http://techcrunch.com/2013/04/23/google-buys-wavii-for-north-
of-30-million/)

~~~
yannyu
Also used extensively in Lucene based search engines such as Solr and Elastic.
Most companies running search use a Lucene-based search engine.

[https://lucene.apache.org/](https://lucene.apache.org/)

[http://lucene.apache.org/solr/](http://lucene.apache.org/solr/)

[https://www.elastic.co/](https://www.elastic.co/)

------
wyldfire
Is this similar to the concept used by Amazon's "statistically improbable
phrases" (word-based instead of n-gram based)?

EDIT: according to SO, yes:
[http://stackoverflow.com/a/2009546/489590](http://stackoverflow.com/a/2009546/489590)

~~~
dangerlibrary
Yes! Although ...

"Wait a minute. Strike that. Reverse it. Thank you."

TF-IDF is old, and very cool. n-gram based extensions of it are a bit newer,
but are likely implemented in almost exactly the same way. N-grams just
require a lot more compute power because your corpus grows faster than a plain
ol' bag of words.

------
languagehacker
Nice job explaining a fundamental IR algorithm.

