

Pattern - Web Mining Python lib - interro
https://github.com/clips/pattern
More info here: http://www.clips.ua.ac.be/pages/pattern
======
languagehacker
Really cool library. I'm excited to take it for a spin! I liked that there was
some work done already for Wikipedia. But as a note to people who want to work
with Wikipedia data, it's not very hard to abstract your stuff to work with
most wikis based on the MediaWiki platform. I've added a pull request to this
project that also supports using the hundreds of thousands of wikis on Wikia.
( <https://github.com/clips/pattern/pull/17> )

------
interro
More info <http://www.clips.ua.ac.be/pages/pattern>

------
knes
I'm a big fan of data mining so I'll make sure to take it out for spin :) And
from fellow belgian people, nice!

------
salimmadjd
This is awesome! Any plans to add other sites, like amazon, yelp, tripadvisor,
etc!

~~~
stevejohnson
NB: screen-scraping Yelp is against the TOS and you'll get shut down pretty
fast if you try it.

~~~
Terretta
Exactly. Screen scrape Google search results instead, I've heard that works
great. Bet your business model on it, I've heard. ;-)

~~~
bigthingnext
Well, Google "screen scrapes" millions of websites. They bet their business on
it and they seem to be doing OK.

If you upload something to an http server connected to the public internet on
tcp/80, and you don't exclude the path to it in robots.txt, then should anyone
be surprised if it is copied? HTTP clients don't read TOS.

If Google had to read and interpret every every website's TOS, I doubt they
could easily, if at all, produce an index the size of the one they have. It
seems by ignoring a "fear of scraping" they managed to produce something
valuable that the courts seem to side with in spite of offended copyright
holders.

Moreover there's no requirement for them to make their "cache" publicly
accessible. But they do. And again this has held up in court quite well. I
doubt anyone would be surprised that people are using it. Or "scraping" it if
you want to play word games.

------
mkumm
This looks pretty interesting, I will give it a go

