

What tools do you recommend for text mining? - andresmh

I have a text file with online discussions from a website (about 54 million words) and I would like to do some analysis on it. I have done some basic word frequency counts but I am interested in doing things like clustering to find what are the words that appear together more often. Something like this: http://jcmc.indiana.edu/vol8/issue4/rosen.html#sixth<p>I'm looking for simple free tools that can allow me to do some basic analysis on the text that can give me a basic understanding of the content of the text. I'm familiar with perl and python primarily.<p>Thanks.
======
drallison
You have not specified clearly what result you want.

If you want to apply existing tools to solve particular known problems, you
might want to look at Tony Segaran's Programming Collective Intelligence
(2007) for a survey of the sort of things people have done. Or ask his list
for what kinds of things you want to learn from this data.

If you are wanting to discover new relationships between the various, there
are tools for that as well. See. for example,
[http://people.ischool.berkeley.edu/~hearst/papers/acl99/acl9...](http://people.ischool.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html).
Systems that create knowledge from data by some independent process are still
rare and sketchy.

------
DrJosiah
Gensim has LSI and SVD clustering for data, and is in Python.
<http://nlp.fi.muni.cz/projekty/gensim/>

There is also SVM Lite, which can do much of the same things with potentially
less work from you. I've not used it, so I don't know how well it works.
<http://svmlight.joachims.org/>

------
drallison
A useful survey of algorithms--

Top 10 algorithms in data mining Knowledge and Information Systems archive
Volume 14 , Issue 1 (December 2007) table of contents Pages: 1-37 Year of
Publication: 2007 ISSN:0219-1377

This paper presents the top 10 data mining algorithms identified by the IEEE
International Conference on Data Mining (ICDM) in December 2006: C4.5,
k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART.
These top 10 algorithms are among the most influential data mining algorithms
in the research community. With each algorithm, we provide a description of
the algorithm, discuss the impact of the algorithm, and review current and
further research on the algorithm. These 10 algorithms cover classification,
clustering, statistical learning, association analysis, and link mining, which
are all among the most important topics in data mining research and
development.

------
brisance
A lot of resources are in Java.

<http://www.cs.waikato.ac.nz/ml/weka/> <http://rapid-i.com/>

------
Absolute0
NLTK is the best resource if you know python. And nothing beats it!

~~~
andresmh
This looks great. It even has a text book!

~~~
apurva
+1 for nltk... its very very neat!

------
edwtjo
I only know of one, in java, <http://www.ontopia.net>.

~~~
andresmh
Looks interesting. What have you done with it?

------
aitoehigie
Beautiful soup

~~~
andresmh
Isn't Beautiful soup primarily to do screen scraping? I already scraped the
content and stored it in a DB. Now I want to do some analysis on the text.

~~~
irahul
"Programming collective intelligence" is a good book if you are willing to
invest time in it. If your goal is just to get this task done, NLTK would work
for you.

