Hacker News new | comments | show | ask | jobs | submit login
Show HN: TextBlob, Natural language processing made simple in Python (readthedocs.org)
303 points by sloria on Aug 10, 2013 | hide | past | web | favorite | 46 comments

Yay #1: a nice wrapper around NLTK. NLTK is great but its API is not very Pythonic or comfortable. Pleasant facades over it are a great help for Python NLP.

Yay #2: an actually interesting programming-related article on HN. These get rarer every day, losing their place to gossips about what Snowden remarked following some or another NSA official's remarks about Snowden's even earlier remarks.

I'm conflicted on this comment.

On the one hand, I agree with your yay #2(and your yay#1, of course. TextBlob looks great). I think you're right. I have many venues to discuss NSA issues and few venues to discuss startup/programming stuff. I like having a venue that is typically devoted to such stuff.

On the other hand, I'm not sure that saying "this isn't about Snowden" on tech-related articles that don't involve Snowden is the solution. Why bring him into the conversation when we're talking about Python?

> Why bring him into the conversation when we're talking about Python?

Just as words of encouragement for more programming-related content. I'm venting off steam, really. In the past few weeks I've been spending more time on /r/programming than on HN, something I could not imagine a year ago.

I've been doing the exact same thing, and I empathize with you. I'm still just wary of comments like this because I could imagine them taking over otherwise useful tech discussions.

Honestly it's frustrating to me that a meta-discussion like this is necessary here. I think you're right, I just wish I didn't have to say it.

>> ... discuss startup/programming stuff. I like having a venue that is typically devoted to such stuff.

Did you look at https://lobste.rs/ ?

I also feel programming stuff is depleting on HN, however I can relate to all the talk about Snowden on a forum like HN though.

I would be visiting lobsters if it wasn't invite-only, I am not convinced that a system like that guarantees quality content over something more elegant like weighted votes.

As far as I know lobsters is invite only. From what I understand it's a forum I'd enjoy though.

I'd love an invite if you happen to have one. Contact info in my profile.

Sent you an invite!

I agree with your sentiment, but unfortunately if you were interested in avoiding a discussion of Snowden you probably just should have stuck to #1.

I posit that a meta-discussion is different from a discussion in this case.

Just a quick word on Pattern[1].

TextBlob is probably just using the en module, I would suggest everyone take a look at the other modules in particular the web module should you be doing any light data scraping. It has nice wrappers around BeautifulSoup and Scrapy among others, jumping into BeautifulSoup and Scrapy can be daunting for beginners.

[1] http://www.clips.ua.ac.be/pages/pattern

I've had good fun playing around with this, it's certainly made NLP more approachable.

One issue though is that it seems to choke with certain characters.

For instance the character £ it seems to complain with this error message:

>>> TextBlob("£") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/eterm/nlp/local/lib/python2.7/site-packages/text/blob.py", line 340, in __repr__ return unicode("{cls}('{text}')".format(cls=class_name, text=self.raw)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 10: ordinal not in range(128)

Yeah, ditto. I created a new virtualenv with Python 3 and those problems disappeared. Previous to that I hacked around a bit and did the "from __future__ import unicode_literals" bit which alleviated the issue (but then ipython had problems with `repr(blob)`). I finally just gave up and ran `mkvirtualenv textblob --python=python3` (on Ubuntu 12.10).

Ah, I'm still using python 2 which might be causing me problems. For now I'll just try to work around it by hacking it out my source data.

(My source data is my own HN comments, it's funny doing sentiment analysis on them, seeing how objective or subjective it thinks my posts are as well as generally if I'm cheery or miserable.

(My end game is to produce an HN reader which only shows positive comments and news to reduce the amount of reading I do. ;))

It gets it mostly right, except the occasional hiccough, one of which is this following passage, which stood out as my most subjective post (1.0 on subjectivity!): "" Factorisation is unique, the addition of 3 primes is not.<p>e.g. 29 can be written 5 + 11 + 13 or 3 + 3 + 23<p>So even if it were a difficult operation to reverse addition of 3 numbers, it would be made easier by collisions. ""

I'm left stumped as to why nltk thinks this is not only subjective but a 1.0 completely subjective post!

Maybe you're just confusing the facts being stated ("even if it were a difficult operation to reverse addition of 3 numbers" implies it is easy, which is true as it's just an oblique restatement of the commutative property) with the language being used. "Difficult", "easier", "even if", etc.

I have no idea how sentiment analysis works though.

The NodeBox linguistics module is another nice wrapper around NLTK (and other natural language processing libraries). I used it for extracting actions and details from sentences, but it's also great for spelling correction, pluralization, part-of-speech tagging and other common NLP tasks.


This thing has been deprecated and continued again under the name Pattern. http://www.clips.ua.ac.be/pages/pattern

Both for my study and side job I work on NLP with python.

Sorry, but I think this thing is very much overrated by the HN crowd. There are many such libraries and this one adds exactly nothing. I also don't see how this is easier to use than, lets say, Pattern.

Try and add new functionality. One new functionality could be to use an ontology to calculate the distance between two words. Then you can do other cool things with that and place it in your module.

Higher levels of abstraction are less intimidating and easier to get started with.

And, sometimes, less of a hassle even when you are accustomed to the lower levels.

This looks great! NLTK is incredible but definitely can be a bit intimidating. Very cool to have a wrapper around it.

I'm curious to see exactly how it works and so I'll certainly check out the source when I have a bit more time. Thanks for posting this.

If you could add a blob.target and a default vectorizer, you could use scikits learn to offer some nice classification and regression. It's pretty easy to do that with what you have now, but some of those concepts are a little foreign if you haven't done text classification before, like me before yesterday. Particularly the part of speech tagging- using those as features could be powerful alongside n-grams.

For the Google Translate functionality, does this pass the request through an intermediary service or direct to the API?

So after poking around with this for a bit, I will say that it DEFINITELY is vulnerable to Python2's string handling warts. Constructing a `TextBlob` out of a string with non-ASCII characters doesn't seem to work. I created another virtualenv with Python 3 and it works quite well.

I played with this a few days ago. It is a nice wrapper for NLTK. You probably want to, at some point, read the free NLTK book online.

Edit: and it also uses pattern.

Can someone explain what this does in layman's terms? I'm a biz guy, not a coder, but I'm interested in the use cases. thanks

This is an extension to the Python programming language that makes it easier to analyze and manipulate text.

For example, an analyst might use sentiment analysis to see whether Facebook posts about a product are "positive" or "negative" in tone.

As another example, I hacked together this online sentiment analyzer using TextBlob: https://textfeel.herokuapp.com/

See also: NLP (Wikipedia): https://en.wikipedia.org/wiki/Natural_language_processing NLTK (a python library for NLP): http://nltk.org/ Twitter opinion mining using pattern: http://www.clips.ua.ac.be/pages/pattern-examples-elections

OK, so it's mostly to analyze text that's already been written? Can it also write natural language text based on data inputs?

> Can it also write natural language text based on data inputs?

from the features list it doesn't seem to.

What you're referring to is text generated using a Markov Chain algorithm. This will generate text that seems at first glance to be human generated. On closer inspection you'll find that it only follows common linguistic patterns, the actual content is gibberish.

This sounds interesting. Can you specify an example usecase (what would be the input data and how would generated natural language look like), and I will try to see if I can do it.

For example, financial data would be used as input to generate a daily stock market overview. Something along the lines of:

"Today, the Dow hit a high of 16,200, marking the first time it has crossed the 16,000 barrier. blah blah blah, etc"

Basically, use data points to create a market overview where readers wouldn't know that it was computer generated. That's one idea.

I think the only use case is spam. But it's a big one.

Natural language generation is also an NLP task, but this particular library doesn't seem to tackle it at the moment.

Good hack, now I'm following your github!

I'm having deja vu, do you post this in every NLP-related topic?

(posting before I commentstalk you to confirm, this is just as much to test my memory as anything else)

edit: maybe not...damn, I swear I've seen this exact phrase in several NLP threads

I wonder if an NLP-driven scraper could confirm your deja-vu... ;)

Nope, my first time asking this.

This is awesome. I looked, but couldn't find out: is there a word sense disambiguation layer somewhere hidden in here?

Curious as to what training algorithms you used for Sentiment Analysis? Also can I add my domain specific training set?

It uses Naive Bayes Analyzer from NLTK and PatternAnalyzer from Pattern [1]. But, I wouldn't prefer limiting the use to just one or two algorithms. My recent work involved sentiment analysis of data from US politics and I got significant differences in the results when I used different algorithms such as SVM.

[1] https://textblob.readthedocs.org/en/latest/advanced_usage.ht...

This looks great, thanks for sharing.

Any thoughts or relevant benchmarks you would like to share about its speed?

Awesome! Thanks for posting. Are you the hacker that put it together?

Seems to have an incredibly easy interface. Will test it. Well done!

Thanks, I plan on using this.

Awesome. I could use this.

This looks great

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact