Yay #1: a nice wrapper around NLTK. NLTK is great but its API is not very Pythonic or comfortable. Pleasant facades over it are a great help for Python NLP.
Yay #2: an actually interesting programming-related article on HN. These get rarer every day, losing their place to gossips about what Snowden remarked following some or another NSA official's remarks about Snowden's even earlier remarks.
On the one hand, I agree with your yay #2(and your yay#1, of course. TextBlob looks great). I think you're right. I have many venues to discuss NSA issues and few venues to discuss startup/programming stuff. I like having a venue that is typically devoted to such stuff.
On the other hand, I'm not sure that saying "this isn't about Snowden" on tech-related articles that don't involve Snowden is the solution. Why bring him into the conversation when we're talking about Python?
> Why bring him into the conversation when we're talking about Python?
Just as words of encouragement for more programming-related content. I'm venting off steam, really. In the past few weeks I've been spending more time on /r/programming than on HN, something I could not imagine a year ago.
I've been doing the exact same thing, and I empathize with you. I'm still just wary of comments like this because I could imagine them taking over otherwise useful tech discussions.
Honestly it's frustrating to me that a meta-discussion like this is necessary here. I think you're right, I just wish I didn't have to say it.
I would be visiting lobsters if it wasn't invite-only, I am not convinced that a system like that guarantees quality content over something more elegant like weighted votes.
TextBlob is probably just using the en module, I would suggest everyone take a look at the other modules in particular the web module should you be doing any light data scraping. It has nice wrappers around BeautifulSoup and Scrapy among others, jumping into BeautifulSoup and Scrapy can be daunting for beginners.
I've had good fun playing around with this, it's certainly made NLP more approachable.
One issue though is that it seems to choke with certain characters.
For instance the character £ it seems to complain with this error message:
>>> TextBlob("£")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/eterm/nlp/local/lib/python2.7/site-packages/text/blob.py", line 340, in __repr__
return unicode("{cls}('{text}')".format(cls=class_name, text=self.raw))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 10: ordinal not in range(128)
Yeah, ditto. I created a new virtualenv with Python 3 and those problems disappeared. Previous to that I hacked around a bit and did the "from __future__ import unicode_literals" bit which alleviated the issue (but then ipython had problems with `repr(blob)`). I finally just gave up and ran `mkvirtualenv textblob --python=python3` (on Ubuntu 12.10).
Ah, I'm still using python 2 which might be causing me problems. For now I'll just try to work around it by hacking it out my source data.
(My source data is my own HN comments, it's funny doing sentiment analysis on them, seeing how objective or subjective it thinks my posts are as well as generally if I'm cheery or miserable.
(My end game is to produce an HN reader which only shows positive comments and news to reduce the amount of reading I do. ;))
It gets it mostly right, except the occasional hiccough, one of which is this following passage, which stood out as my most subjective post (1.0 on subjectivity!):
""
Factorisation is unique, the addition of 3 primes is not.<p>e.g. 29 can be written 5 + 11 + 13 or 3 + 3 + 23<p>So even if it were a difficult operation to reverse addition of 3 numbers, it would be made easier by collisions.
""
I'm left stumped as to why nltk thinks this is not only subjective but a 1.0 completely subjective post!
Maybe you're just confusing the facts being stated ("even if it were a difficult operation to reverse addition of 3 numbers" implies it is easy, which is true as it's just an oblique restatement of the commutative property) with the language being used. "Difficult", "easier", "even if", etc.
I have no idea how sentiment analysis works though.
The NodeBox linguistics module is another nice wrapper around NLTK (and other natural language processing libraries). I used it for extracting actions and details from sentences, but it's also great for spelling correction, pluralization, part-of-speech tagging and other common NLP tasks.
Both for my study and side job I work on NLP with python.
Sorry, but I think this thing is very much overrated by the HN crowd. There are many such libraries and this one adds exactly nothing. I also don't see how this is easier to use than, lets say, Pattern.
Try and add new functionality. One new functionality could be to use an ontology to calculate the distance between two words. Then you can do other cool things with that and place it in your module.
If you could add a blob.target and a default vectorizer, you could use scikits learn to offer some nice classification and regression. It's pretty easy to do that with what you have now, but some of those concepts are a little foreign if you haven't done text classification before, like me before yesterday. Particularly the part of speech tagging- using those as features could be powerful alongside n-grams.
So after poking around with this for a bit, I will say that it DEFINITELY is vulnerable to Python2's string handling warts. Constructing a `TextBlob` out of a string with non-ASCII characters doesn't seem to work. I created another virtualenv with Python 3 and it works quite well.
> Can it also write natural language text based on data inputs?
from the features list it doesn't seem to.
What you're referring to is text generated using a Markov Chain algorithm. This will generate text that seems at first glance to be human generated. On closer inspection you'll find that it only follows common linguistic patterns, the actual content is gibberish.
This sounds interesting. Can you specify an example usecase (what would be the input data and how would generated natural language look like), and I will try to see if I can do it.
It uses Naive Bayes Analyzer from NLTK and PatternAnalyzer from Pattern [1]. But, I wouldn't prefer limiting the use to just one or two algorithms. My recent work involved sentiment analysis of data from US politics and I got significant differences in the results when I used different algorithms such as SVM.
Yay #2: an actually interesting programming-related article on HN. These get rarer every day, losing their place to gossips about what Snowden remarked following some or another NSA official's remarks about Snowden's even earlier remarks.