Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Always fascinated by how diverse the discussion and answers is for HN threads on web-scraping. Goes to show that "web-scraping" has a ton of connotations, everything from automated-fetching of URLs via wget or cURL, to data management via something like scrapy.

Scrapy is a whole framework that may be worthwhile, but if I were just starting out for a specific task, I would use:

- requests http://docs.python-requests.org/en/master/

- lxml http://lxml.de/

- cssselect https://cssselect.readthedocs.io/en/latest/

Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize. But using the web developer tools you can usually figure out the requests made by the browser and then use the Session object in the Requests library to deal with stateful requests:

http://docs.python-requests.org/en/master/user/advanced/

I usually just download pages/data/files as raw files and worry about parsing/collating them later. I try to focus on the HTTP mechanics and, if needed, the HTML parsing, before worrying about data extraction.




> Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize. But using the web developer tools you can usually figure out the requests made by the browser and then use the Session object in the Requests library to deal with stateful requests

You could also use the WebOOB (http://weboob.org) framework. It's built on requests+lxml and it provides a Browser class usable like mechanize's one (ability to access doc, select HTML forms, etc.).

It also has nice companion features like associating url patterns to some custom Page classes where you can write what data to retrieve when a page with this url pattern is browsed.


All great advice. I've written dozens of small purpose-built scrapers and I love your last point.

It's pretty much always a great idea to completely separate the parts that perform the HTTP fetches and the part that figures out what those payloads mean.


lxml has good xpath support too; the best I've seen. I miss good xpath support in some of the other scraping options I've tried in other languages.


>Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize.

Did the version of Mechanize written in Py2 stop being supported?


Looks like it's recently been updated but no big announcement that it's Python 3 ready: https://github.com/python-mechanize/mechanize

I've also seen these alternatives:

- https://robobrowser.readthedocs.io/en/latest/

- https://github.com/MechanicalSoup/MechanicalSoup

MechanicalSoup seems well updated but the last time I tried these libraries, they were either buggy (and/or I was ignorant) and I just couldn't get things to work as I was used to in Ruby and Mechanize.


lxml can be hit-or-miss on HTML5 docs. I've had greater success with a modified version of gumbo-parser.


Ah very cool, had seen various python libraries about HTML5, but not gumbo (or at least I had starred it).

https://github.com/google/gumbo-parser

Is the modified version you use a personal version or a well-known fork?


> Is the modified version you use a personal version or a well-known fork?

I had a specific thing I needed to do, gumbo-parser was a good match, I poked at it a little and moved on. It started with this[1] commit, then I did some other work locally which was not pushed because google/gumbo-parser is without an owner/maintainer. There are a couple of forks, but no/little adoption it seems.

[1] https://github.com/sebcat/gumbo-parser/commit/c158f8090c2df0...




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: