

Ask HN: Is there any HTML table scraper generator in python or else? - jeffjia

Hi,<p>In one of my projects, I happen to need to get some scrapers running for tens of websites to collect rows, columns of tables (&lt;table&gt;, &lt;ul&gt;, &lt;div&gt;). Those tables are well formatted. I have written several scrapers in python, which basically use CSS selector and then do some simple transformation with regular expression. I just wonder whether there is any scraper generator which may take a url and sample target output as input, and produce a scraper automatically?<p>Any suggestion is welcomed. Thanks in advance.
======
tonyfelice
Have you looked at phantomjs?

The webintro example here
([https://github.com/ariya/phantomjs/wiki/Examples](https://github.com/ariya/phantomjs/wiki/Examples))
scrapes a specific element.

~~~
jeffjia
I was using mechanizer + beautiful soup in python before, but it seems that
this one also needs human to read the html source code to pick a css selector
instead of automating it...

------
brandonlipman
I would take a look at the Mac App FakeApp. It does a lot of what you are
saying expecially in regards to CSS and xpath selectors. I have been using it
and have been able to do some really great stuff.

------
Johnie
If you don't want to build it yourself, check out import.io. They turn any
website into an API. They did a demo at SV Newtech a couple months ago.

~~~
jeffjia
Thanks Johnie. It is almost what I want, except that it is not open-source and
free...

------
murtza
Have you taken a look at the Scrapy framework for Python?

[http://scrapy.org/](http://scrapy.org/)

~~~
jeffjia
Thanks. I used beautiful soup for the parser, and actually have written a
crawler framework for my scenario. But I was wondering whether there is any
tool that could automate the selection of css selector or xpath.

------
taddeimania
I've used BeautifulSoup to do stuff like this.

~~~
jeffjia
Yeah. Me too. The css selector is quite convenient. The only problem is that I
need to pick the selector set for each website I need to scrape, and there are
tens of them, which makes the work itself time-consuming...

