

Ask HN: What are some good open-source language/textual analysis tools? - CoreSet

I&#x27;m looking to do linguistic&#x2F;textual analysis on a large amount of text I&#x27;ve scraped for a research project, finding stats like: frequently used words, associated topic clusters, gender estimations.<p>I wrote the scraper myself, but the language analysis is something it seems it&#x27;d be easier to find OS and use out of the box or slightly modified.<p>Anyone have any ideas&#x2F;leads? Preference is for a script or process I can run from the CL to output the vitals.
======
whitej125
If you are a Python person (very popular language in the data sciences realm
these days). Your gateway drug to linguistic and textual analysis is going to
be NLTK.

[http://www.nltk.org/](http://www.nltk.org/)

The free book and tutorials are great and you can get up and running pretty
quickly.

NLTK's lower learning curve is great for getting your head around NLP
concepts. Once you start looking for increased function or performance...
you'll find yourself graduating to a SciKit-Learn ([http://scikit-
learn.org/stable/](http://scikit-learn.org/stable/)).

In the Java world... I think Mahout is/was popular. Quite a bit more setup to
get through in order get this up and running.

~~~
CoreSet
I'm actually a dyed-in-the-wool Node/js acolyte (which I used for the
scraper), but I've played around with Python, and I'm sure it would be worth
learning as a second language (I don't count my very-out-of-date Ruby).

Thanks for the tip!

------
manidoraisamy
Stanford NLP is pretty good, if you are on java -
[http://nlp.stanford.edu/software/corenlp.shtml](http://nlp.stanford.edu/software/corenlp.shtml)

You might also want to look at word2vec (implemented in most of the popular
languages) -
[https://code.google.com/p/word2vec/](https://code.google.com/p/word2vec/)

------
wallflower
This seems to have some good starting pointers:

[http://blog.datadive.net/which-topics-get-the-upvote-on-
hack...](http://blog.datadive.net/which-topics-get-the-upvote-on-hacker-news/)

~~~
CoreSet
Thanks! Yeah, this helps confirm the utility of picking up some Python.

------
biomimic
This text summarizer will be open sourced soon:
[http://genopharmix.com/TuataraSum](http://genopharmix.com/TuataraSum)

