

Show HN: TextBlob, Natural language processing made simple in Python - sloria
https://textblob.readthedocs.org/en/latest/

======
eliben
Yay #1: a nice wrapper around NLTK. NLTK is great but its API is not very
Pythonic or comfortable. Pleasant facades over it are a great help for Python
NLP.

Yay #2: an actually interesting programming-related article on HN. These get
rarer every day, losing their place to gossips about what Snowden remarked
following some or another NSA official's remarks about Snowden's even earlier
remarks.

~~~
eieio
I'm conflicted on this comment.

On the one hand, I agree with your yay #2(and your yay#1, of course. TextBlob
looks great). I think you're right. I have many venues to discuss NSA issues
and few venues to discuss startup/programming stuff. I like having a venue
that is typically devoted to such stuff.

On the other hand, I'm not sure that saying "this isn't about Snowden" on
tech-related articles that don't involve Snowden is the solution. Why bring
him into the conversation when we're talking about Python?

~~~
visionscaper
>> ... discuss startup/programming stuff. I like having a venue that is
typically devoted to such stuff.

Did you look at [https://lobste.rs/](https://lobste.rs/) ?

I also feel programming stuff is depleting on HN, however I can relate to all
the talk about Snowden on a forum like HN though.

~~~
eieio
As far as I know lobsters is invite only. From what I understand it's a forum
I'd enjoy though.

I'd love an invite if you happen to have one. Contact info in my profile.

~~~
JOfferijns
Sent you an invite!

------
mrkmcknz
Just a quick word on Pattern[1].

TextBlob is probably just using the en module, I would suggest everyone take a
look at the other modules in particular the web module should you be doing any
light data scraping. It has nice wrappers around BeautifulSoup and Scrapy
among others, jumping into BeautifulSoup and Scrapy can be daunting for
beginners.

[1]
[http://www.clips.ua.ac.be/pages/pattern](http://www.clips.ua.ac.be/pages/pattern)

------
eterm
I've had good fun playing around with this, it's certainly made NLP more
approachable.

One issue though is that it seems to choke with certain characters.

For instance the character £ it seems to complain with this error message:

>>> TextBlob("£") Traceback (most recent call last): File "<stdin>", line 1,
in <module> File "/home/eterm/nlp/local/lib/python2.7/site-
packages/text/blob.py", line 340, in __repr__ return
unicode("{cls}('{text}')".format(cls=class_name, text=self.raw))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 10:
ordinal not in range(128)

~~~
mattdeboard
Yeah, ditto. I created a new virtualenv with Python 3 and those problems
disappeared. Previous to that I hacked around a bit and did the "from
__future__ import unicode_literals" bit which alleviated the issue (but then
ipython had problems with `repr(blob)`). I finally just gave up and ran
`mkvirtualenv textblob --python=python3` (on Ubuntu 12.10).

~~~
eterm
Ah, I'm still using python 2 which might be causing me problems. For now I'll
just try to work around it by hacking it out my source data.

(My source data is my own HN comments, it's funny doing sentiment analysis on
them, seeing how objective or subjective it thinks my posts are as well as
generally if I'm cheery or miserable.

(My end game is to produce an HN reader which only shows positive comments and
news to reduce the amount of reading I do. ;))

It gets it mostly right, except the occasional hiccough, one of which is this
following passage, which stood out as my most subjective post (1.0 on
subjectivity!): "" Factorisation is unique, the addition of 3 primes is
not.<p>e.g. 29 can be written 5 + 11 + 13 or 3 + 3 + 23<p>So even if it were a
difficult operation to reverse addition of 3 numbers, it would be made easier
by collisions. ""

I'm left stumped as to why nltk thinks this is not only subjective but a 1.0
completely subjective post!

~~~
mattdeboard
Maybe you're just confusing the facts being stated ("even if it were a
difficult operation to reverse addition of 3 numbers" implies it is easy,
which is true as it's just an oblique restatement of the commutative property)
with the language being used. "Difficult", "easier", "even if", etc.

I have no idea how sentiment analysis works though.

------
feniv
The NodeBox linguistics module is another nice wrapper around NLTK (and other
natural language processing libraries). I used it for extracting actions and
details from sentences, but it's also great for spelling correction,
pluralization, part-of-speech tagging and other common NLP tasks.

[http://nodebox.net/code/index.php/Linguistics](http://nodebox.net/code/index.php/Linguistics)

~~~
Ihmahr
This thing has been deprecated and continued again under the name Pattern.
[http://www.clips.ua.ac.be/pages/pattern](http://www.clips.ua.ac.be/pages/pattern)

------
Ihmahr
Both for my study and side job I work on NLP with python.

Sorry, but I think this thing is very much overrated by the HN crowd. There
are many such libraries and this one adds exactly nothing. I also don't see
how this is easier to use than, lets say, Pattern.

Try and add new functionality. One new functionality could be to use an
ontology to calculate the distance between two words. Then you can do other
cool things with that and place it in your module.

~~~
mattdeboard
Higher levels of abstraction are less intimidating and easier to get started
with.

~~~
alxndr
And, sometimes, less of a hassle even when you are accustomed to the lower
levels.

------
eieio
This looks great! NLTK is incredible but definitely can be a bit intimidating.
Very cool to have a wrapper around it.

I'm curious to see exactly how it works and so I'll certainly check out the
source when I have a bit more time. Thanks for posting this.

------
the_cat_kittles
If you could add a blob.target and a default vectorizer, you could use scikits
learn to offer some nice classification and regression. It's pretty easy to do
that with what you have now, but some of those concepts are a little foreign
if you haven't done text classification before, like me before yesterday.
Particularly the part of speech tagging- using those as features could be
powerful alongside n-grams.

------
shirkey
For the Google Translate functionality, does this pass the request through an
intermediary service or direct to the API?

------
mattdeboard
So after poking around with this for a bit, I will say that it DEFINITELY is
vulnerable to Python2's string handling warts. Constructing a `TextBlob` out
of a string with non-ASCII characters doesn't seem to work. I created another
virtualenv with Python 3 and it works quite well.

------
mark_l_watson
I played with this a few days ago. It is a nice wrapper for NLTK. You probably
want to, at some point, read the free NLTK book online.

Edit: and it also uses pattern.

------
sixQuarks
Can someone explain what this does in layman's terms? I'm a biz guy, not a
coder, but I'm interested in the use cases. thanks

~~~
sloria
This is an extension to the Python programming language that makes it easier
to analyze and manipulate text.

For example, an analyst might use sentiment analysis to see whether Facebook
posts about a product are "positive" or "negative" in tone.

As another example, I hacked together this online sentiment analyzer using
TextBlob: [https://textfeel.herokuapp.com/](https://textfeel.herokuapp.com/)

See also: NLP (Wikipedia):
[https://en.wikipedia.org/wiki/Natural_language_processing](https://en.wikipedia.org/wiki/Natural_language_processing)
NLTK (a python library for NLP): [http://nltk.org/](http://nltk.org/) Twitter
opinion mining using pattern: [http://www.clips.ua.ac.be/pages/pattern-
examples-elections](http://www.clips.ua.ac.be/pages/pattern-examples-
elections)

~~~
sixQuarks
OK, so it's mostly to analyze text that's already been written? Can it also
write natural language text based on data inputs?

~~~
random42
This sounds interesting. Can you specify an example usecase (what would be the
input data and how would generated natural language look like), and I will try
to see if I can do it.

~~~
sixQuarks
For example, financial data would be used as input to generate a daily stock
market overview. Something along the lines of:

"Today, the Dow hit a high of 16,200, marking the first time it has crossed
the 16,000 barrier. blah blah blah, etc"

Basically, use data points to create a market overview where readers wouldn't
know that it was computer generated. That's one idea.

------
throwawayg99
This is awesome. I looked, but couldn't find out: is there a word sense
disambiguation layer somewhere hidden in here?

------
sumit_psp
Curious as to what training algorithms you used for Sentiment Analysis? Also
can I add my domain specific training set?

~~~
pavanred
It uses Naive Bayes Analyzer from NLTK and PatternAnalyzer from Pattern [1].
But, I wouldn't prefer limiting the use to just one or two algorithms. My
recent work involved sentiment analysis of data from US politics and I got
significant differences in the results when I used different algorithms such
as SVM.

[1]
[https://textblob.readthedocs.org/en/latest/advanced_usage.ht...](https://textblob.readthedocs.org/en/latest/advanced_usage.html#sentiment-
analyzers)

------
dpmehta02
This looks great, thanks for sharing.

Any thoughts or relevant benchmarks you would like to share about its speed?

------
tomrod
Awesome! Thanks for posting. Are you the hacker that put it together?

------
gpsarakis
Seems to have an incredibly easy interface. Will test it. Well done!

------
aswanson
Thanks, I plan on using this.

------
photorized
Awesome. I could use this.

------
misiti3780
This looks great

