

Ask HN: Anyone here do Web mining/harvesting/scraping? - Zelgius

I've seen a bunch of software solutions out there.
WebQL lets you write SQL-like code to harvest websites. There are also GUI solutions like modenza, fetch and kapow.<p>Does anyone here work with these kinds of solutions and what do you consider to be the best approach and why?<p>Thanks!
======
truiu
I have done some webscraping with twisted and BeautifulSoup. But it was just
for a few small sites.

That was when I found <http://scrapy.org/> , a complete python framework for
writing webcrawlers. It looks promissing, but I haven't used it so far.

------
cjoh
We do a lot of scraping and the best we've found, really, is beautifulsoup and
being good at python. There's no UI or easy tool that's going to replace that.

~~~
Zelgius
Actually WebQL is pretty amazing. A simple program like select * from links
within <http://www.cnn.com> to an advanced 10 page script to pull and merge
multiple sources. The code is real easy to read and if you have any kind of
SQL background very easy to learn. We're just always keeping our eyes out for
something better/cheaper (ie free) :)

------
lazy_nerd
I would recommend hosted crawling/scrapping solutions where you don't need to
dedicate your own hardware resources. Also monitoring and maintaining the
scrappers/harvesters can be a pain. Shameless Plug: I work @ beevolve
(www.beevolve.com) and we provide hosted crawling and scrapping solutions. PM
me if you are interested.

------
hellotoby
I use a mixture of cURL and PHP's DOM and simpleXML functions to scrape what I
need.

------
keefe
I've done it with Java, HttpClient and regular expressions.

