Always fascinated by how diverse the discussion and answers is for HN threads on...

hydragit · on Nov 16, 2017

> Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize. But using the web developer tools you can usually figure out the requests made by the browser and then use the Session object in the Requests library to deal with stateful requests

You could also use the WebOOB (http://weboob.org) framework. It's built on requests+lxml and it provides a Browser class usable like mechanize's one (ability to access doc, select HTML forms, etc.).

It also has nice companion features like associating url patterns to some custom Page classes where you can write what data to retrieve when a page with this url pattern is browsed.

djtriptych · on Nov 14, 2017

All great advice. I've written dozens of small purpose-built scrapers and I love your last point.

It's pretty much always a great idea to completely separate the parts that perform the HTTP fetches and the part that figures out what those payloads mean.

Buttons840 · on Nov 14, 2017

lxml has good xpath support too; the best I've seen. I miss good xpath support in some of the other scraping options I've tried in other languages.

upofadown · on Nov 14, 2017

>Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize.

Did the version of Mechanize written in Py2 stop being supported?

danso · on Nov 14, 2017

Looks like it's recently been updated but no big announcement that it's Python 3 ready: https://github.com/python-mechanize/mechanize

I've also seen these alternatives:

- https://robobrowser.readthedocs.io/en/latest/

- https://github.com/MechanicalSoup/MechanicalSoup

MechanicalSoup seems well updated but the last time I tried these libraries, they were either buggy (and/or I was ignorant) and I just couldn't get things to work as I was used to in Ruby and Mechanize.

sebcat · on Nov 14, 2017

lxml can be hit-or-miss on HTML5 docs. I've had greater success with a modified version of gumbo-parser.

danso · on Nov 14, 2017

Ah very cool, had seen various python libraries about HTML5, but not gumbo (or at least I had starred it).

https://github.com/google/gumbo-parser

Is the modified version you use a personal version or a well-known fork?

sebcat · on Nov 14, 2017

> Is the modified version you use a personal version or a well-known fork?

I had a specific thing I needed to do, gumbo-parser was a good match, I poked at it a little and moved on. It started with this[1] commit, then I did some other work locally which was not pushed because google/gumbo-parser is without an owner/maintainer. There are a couple of forks, but no/little adoption it seems.

[1] https://github.com/sebcat/gumbo-parser/commit/c158f8090c2df0...