
Pattern: A web mining & natural language processing system for Python - rbreve
http://www.clips.ua.ac.be/pages/pattern
======
tswicegood
This is all really cool stuff, but I can't help but thinking they've got a
dozen or so separate packages here. They've also re-invented the wheel at
every turn. All new wrapper for the Twitter API and every search engine? All
new graphing library for JavaScript.

Don't get me wrong, this is all awesome, but a lot of it could have been re-
used from better sources without having to spend time working on random API
wrappers and the like. I would definitely like to know the reasoning behind
creating everything from scratch.

~~~
tomdesmedt
The reason is mainly 1) licensing and 2) integration. I took what I found that
fitted the BSD license and wrote the rest myself. The idea was having all the
things I needed in a single, concise package. I like carrying around 1
MacGyver knife instead of a heavy toolbox - even if all the separate tools in
the box are more robust than the knife.

As for the JavaScript graph library: Daniel Friesen had already ported 90% of
the Python code, so it only took me a day or two to finish it. The result is a
single file (graph.js) with lots of things besides visualization (eigenvector
centrality etc.) which seemed better suited to Pattern than integrating
another, bigger project.

Best, Tom

------
stdbrouw
Looks nice, though less full-featured than the NLTK. I'd be interested to see
how nice they'd play together and whether applications could exploit the
strengths of both at the same time. The only thing that's better than one good
NLP framework, is two good NLP frameworks, after all.

~~~
baltcode
A recurring meme in terms of frameworks is people keep bringing out tools and
some of them disappear, some find niche applications, and some become
mainstream. Though I haven't heard of many nlp toolkits (but I'm not in that
field).

I want to jump into some basic NLP, but I'd like to stick with one or two
toolkits. I had heard of nltk before this, but are there any other
comprehensive or sort of succesful frameworks out there one should be aware
of? (Either in python or something else)

~~~
gilesc
The best toolkits are probably in Java:

-Stanford's Tagger, Parser, and NLP Core

-Apache OpenNLP

-Lingpipe

Many smaller components are made to be compatible with IBM UIMA (of Watson
fame), so they are able to be integrated into a pipeline somewhat easily. For
examples of this in biomedical TM, see <http://u-compare.org/> .

People will kill me for saying this, but truly: Python's performance isn't
adequate for large-scale text mining, _especially_ if you want to do deep/full
parsing. Shallow parsing as shown in this package's demo is more feasible.

I personally find NLTK convoluted, but in its favor, it does have readers for
a TON of corpora, which is really nice.

~~~
devinj
My friends in the natural language field tell me Python and NLTK are more
common than Java. Then again, this is at a sort-of Python-centric university
(Toronto).

------
raufrajar
It seems like a very nice tool and many hackers would want to play with it.
However, it will be really convenient if the project is put on GitHub.

------
TuxPirate
The official project page: <http://nodebox.net/code/index.php/Perception>

------
soulclap
I don't know a whole lot about text analysis and the mentioned algorithms, can
this be used to analyze articles and determine which are dealing with the same
subject? Techmeme-ish? Or what would be a good starting point for this? (Or
would this be better off in an 'Ask HN' post? I am one of those horrible new
people on here.)

~~~
simonb
The: "tf-idf + cosine similarity + LSA metrics" bit from Pattern is what you
are looking for.

~~~
thezilch
In other words, the _vector_ module: <http://www.clips.ua.ac.be/pages/pattern-
vector>

------
syllogism
I'm going to be involved in teaching an NLP course this semester, and we're
debating what to put in it. What are some things you want to do with NLP, and
what would you hope to learn (or have a future employee learn) in an honours
and masters level course?

------
derrida
Imagine spidering twitter for phrases of the type "x is a type of y" in order
to form a database of real world objects in an inheritance hierarchy. Now
imagine when you have these objects, finding out what these objects do by
looking at verbs that occur around them. Boom. You have objects, and you have
the methods you need to write. Now you just need someone to write the code!
The methods writing could become a sort of captcha exercise.

~~~
derrida
All of this is to upload the universe of course into some sort of Minecraft
game.

------
phreeza
Project thats been in the back of my head for a while, but have no time to do:

Analyze HN comments over time with some NLP techniques, maybe sentiment
analysis. Then if the next wave of "HN is turning into Reddit" posts comes,
point people to the analysis, whatever the conclusions are.

Seems like this would be well suited for the task. Any takers?

~~~
waterside81
Use our API:

<http://www.repustate.com/docs/>

------
ericxtang
Really fantastic piece of software. It's about time we move away from using
java wrappers for NLP stuff. Anyone know a similar project in Ruby?

------
logjam
For we who have had to roll some of the same functionality piecemeal out of
tools like the Stanford NLP Core, tregex/tsurgeon, Wordnet, Beautiful Soup,
and python nlptk, this looks on the surface to be pretty sweet. BSD licensing.

Here's a cool application - tagging negation and speculation clauses in some
text (their demo has been trained on biomedical text):

Example sentence: When U937 cells were infected with HIV-1, no induction of
NF-KB factor was detected, whereas high level of progeny virions was produced,
suggesting that this factor was not required for viral replication.

Result: When U937 cells were infected with HIV-1 , [NEG0 no induction of NF-KB
factor was detected NEG0] , whereas high level of progeny virions was produced
, [SPEC2 suggesting that this factor was [NEG1 not required for viral
replication NEG1] SPEC2] .

------
derrida
I actually just literally drooled on the keyboard!

