
Teaching a Computer to Read: NLP Hacking in Python - msalahi
http://blog.scripted.com/staff/nlp-hacking-in-python/
======
drakaal
The CPU cost to do use this approach is terribly high. I don't think this
approach is going to give better results than a few simple rules and NLTK
would.

This API will do a better job telling you what an article is about.
[https://www.mashape.com/stremor/stremor-noun-phrase-and-
part...](https://www.mashape.com/stremor/stremor-noun-phrase-and-part-of-
speech-tagging-alpha)

That said, the approach we use for our TLDR software and search rankings
doesn't rely on just frequency, the adjectives that amplify the content, the
sentences with emotion attached to them, and the "charge" of words matters too
much.

Consider the following:

That frakking loser Drakaal came over and hijacked my NLP thread. Just because
he does NLP for a living, and thinks he knows everything doesn't mean a thing.
My NLP is way cooler because it uses machine learning and that is the future
of NLP, not the heuristics model he uses for his stuff.

What is the "core" of that? Clearly it is about how Drakaal sucks, but we only
mention him once. NLP is important, machine learning is important, but really
it is about why Drakaal sucks.

~~~
adpreese
I tried to test your example with your API, but it requires a credit card even
for the freemium plan. Is there any way you can make a rate limited API that
never charges to avoid that? I'm not familiar with mashape so it may not be
possible.

~~~
drakaal
Saw you are trying it out. Awesome! Sorry the documentation is a bit weak
right now, we had people wanting it so we got it out, rather than getting all
the docs complete.

~~~
adpreese
I did try it out. It does a good job of pulling out different bits and
categorizing them. I went ahead and ran the example you had and put it up to
continue the
conversation([https://gist.github.com/adpreese/6722561](https://gist.github.com/adpreese/6722561)).
If you want me to take it down, I will certainly respect that but I thought
it'd be convenient for anyone else paying attention.

The noun phrases part of the response gave a concise list of things, including
the word thing(hijacked, NLP thread, stuff, My NLP, thing, Drakaal, cooler,
heuristics, Just). That's maybe good at picking out the nouns, but it's not
really actionable yet. It might be great as the bag of words to use for trying
to classify something, but by itself the best it's giving me is NLP/heuristics
if I had the concepts grouped together, somehow. I think that's a reasonable
takeaway from your example, but I'd be curious what your thoughts are on it.

PS. I tried the TLDR API on a copy and paste of the original article with
commas, periods, and single and double quotes removed but it returned a 500
error. I'm probably doing something wrong, but I'd love to see what it spits
out if you can help me out.

~~~
drakaal
We have the groupings and we can use sentiment to get the importance. The API
is somewhat limited compared to our full bag of tricks, mostly because we
don't want to give away all of our secrets, but also because we change things
pretty often and would have to let others know when we made changes if they
were consuming an API.

------
tdj
Actually, I think you could save yourself some trouble and use scikit-learn's
built-in text preprocessing utils:

Word counter: [http://scikit-
learn.org/stable/modules/generated/sklearn.fea...](http://scikit-
learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)

Hashing vectorizer if you want to trade off explainability for speed and
scalability: [http://scikit-
learn.org/stable/modules/generated/sklearn.fea...](http://scikit-
learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html)

TF-IDF weighing: [http://scikit-
learn.org/stable/modules/generated/sklearn.fea...](http://scikit-
learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html)

Also, if you transform bag-of-words vectors into a dense form, you're gonna
have a bad time (insert appropriate meme picture here). In large corpora,
dimensionality grows quite substantially - if you work with news corpora or
Wikipedia, you're in the 100k-1M dimensional space pretty quickly.

Great to see an approachable explanation for NLP. As they say sometimes, when
you know how it's done, it stops being "Artificial Intelligence".

------
adpreese
How do you deal with keeping your super rare words list sensible? For many
forms of technical writing I could see things getting out of hand where you
have lots of tiny dense clusters not really close to anything else if you
didn't manage the list well.

~~~
msalahi
As with most successful applications of machine learning, it's about finessing
your approach based on the problem at hand. In our case, we have classes
divided on the level of "Medicine," "Real Estate," etc. So, we could throw
away lots of words that only occurred once or twice in the massive corpus we
crawled to build the language model and still have a pretty robust
representation of the subject you're trying to represent.

~~~
msalahi
In fact, if your training corpus is sufficiently large, you'd be shocked how
many words you can eliminate right away for a term frequency of one or two. I
went from millions of words in the vocabulary to something like 60k just by
ignoring words that happen once or twice in the corpus. Plus, you probably
won't learn much about the relationships between words if they only occur a
few times in the corpus.

~~~
jlees
Yeah, but consider that some rare words are much stronger indicators of topic
than more common ones. Even more so if you look at n-grams. If you use
something like wordnet you can get a lot of meaning out of low-frequency words
and throw away the meaningless higher-frequency ones that occur in too many
categories to be useful.

~~~
adpreese
Sure, there's value in rare words, but I don't think anything that occurs
across the corpus fewer than 3 times is going to tell you anything useful. You
need a certain amount just to have it be a real signal. What was the least
frequent useful word in the data set, msalahi?

------
taariqlewis
Interesting to see a content marketplace company using technology well ahead
of its peers. I wonder how many other firms are pursuing this type of
commercial research.

~~~
rbucks
Great question! Probably not very many.

------
aas48
Genius

