

Ask HN: Where can I find tf–idf for English? - paraschopra

I want to calculate weights for the terms appearing in webpages, and for that I require tf–idf (term frequency–inverse document frequency), which when multiplied by term frequency in the webpage gives how important is the term in characterizing it.<p>The problem is that for a set of documents you can calculate tf-idf by counting frequencies in all documents, but this is not possible with webpages as the Internet has nearly infinite English webpages. To solve this problem, I am considering two approaches:<p>1. Scraping the number of results returned by Google for a term and taking that as a proxy
2. Using Wikipedia as a proxy for the whole Internet<p>The problem with first is that it is not scalable and it is against Google's TOS. 2nd approach is more tractable but the Wikipedia dump (http://static.wikipedia.org/) is about 14G zipped (it included images, which I don't require), which I guess is huge.<p>Does anyone know any processed list of such form? Any English corpus with term frequencies? If not, I guess rather than processing all Wikipedia, a better approach would be to crawl a (random?) subset of Wikipedia pages and process them.  Any suggestions or tips?
======
gtani
The reuters corpus was used for a lot of papers, and TREC.

[http://lucene.grantingersoll.com/2008/05/18/open-source-
sear...](http://lucene.grantingersoll.com/2008/05/18/open-source-search-
engine-relevance/)

[http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_r...](http://jmlr.csail.mit.edu/papers/volume5/lewis04a/lyrl2004_rcv1v2_README.htm)

------
yannis
Your best bet is to download the Brown corpus. It is a bit dated but well
documented. However, neither the Brown corpus nor the Wikipedia are a
representative sample of the web. If you let us know a bit more as to what you
are trying to do, we may be able to add better solutions.

~~~
paraschopra
I am trying to extract a set of keywords (and their weights) for any given
webpage. Another option is to use Google's n gram but which has 13 million
unigrams, but there is no option to download it (you have to order it for some
700 USD, I guess)

Brown Corpus, doesn't give term frequencies, does it?

~~~
yannis
The advantage of using well known corpora is that they are also 'tagged' so
later on you can extend your text analysis. Text analysis is addictive, the
more you do the more you want. Another option is the British National Corpus
at probably around twenty pounds for the cd, with over a million words. The
latter is also available in XML.

I had a similar situation about two years ago. I did some research on blogs.
As a sideline the word 'blogs' does not even appears once in the British
National Corpus! This put an end to me comparing against known corpora.

Quick and dirty solution. Write your own routines. At the time I used PHP!
Python is a better option but my Python skills were limited. I spidered blogs
(with cURL). If the word 'blogger' 'blogroll' or 'comments' appeared I
spidered further.

Once I had 20,000 pages I started calculations. As you add to the corpus your
statistics change, so you need to find an efficient method of calculating all
these. With PHP I was able to download on average one page per 20 seconds.

Your stop words need to also evolve as you go on, things like RSS, will crop
up all the time, you would also think that 'wordpress' is one of the most
common words in English! This led me to the next stage polishing up some form
of pre-processor for cleaning up HTML. (Most Classes out there are not
adequate, they tend to leave items in such as CSS style etc..). I used tidy
believe it or not, very effectively for quick pre-processing. (Helped a lot to
just identify only the <body>).

Other adventures along the way was downloading books from the Gutenberg
Organization and using them as corpora.

------
sharpn
I suggest you use a newspaper's archive if wikipedia doesn't suit you.

