Always fascinated by how diverse the discussion and answers is for HN threads on web-scraping. Goes to show that "web-scraping" has a ton of connotations, everything from automated-fetching of URLs via wget or cURL, to data management via something like scrapy.
Scrapy is a whole framework that may be worthwhile, but if I were just starting out for a specific task, I would use:
Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize. But using the web developer tools you can usually figure out the requests made by the browser and then use the Session object in the Requests library to deal with stateful requests:
I usually just download pages/data/files as raw files and worry about parsing/collating them later. I try to focus on the HTTP mechanics and, if needed, the HTML parsing, before worrying about data extraction.
> Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize. But using the web developer tools you can usually figure out the requests made by the browser and then use the Session object in the Requests library to deal with stateful requests
You could also use the WebOOB (http://weboob.org) framework. It's built on requests+lxml and it provides a Browser class usable like mechanize's one (ability to access doc, select HTML forms, etc.).
It also has nice companion features like associating url patterns to some custom Page classes where you can write what data to retrieve when a page with this url pattern is browsed.
All great advice. I've written dozens of small purpose-built scrapers and I love your last point.
It's pretty much always a great idea to completely separate the parts that perform the HTTP fetches and the part that figures out what those payloads mean.
MechanicalSoup seems well updated but the last time I tried these libraries, they were either buggy (and/or I was ignorant) and I just couldn't get things to work as I was used to in Ruby and Mechanize.
> Is the modified version you use a personal version or a well-known fork?
I had a specific thing I needed to do, gumbo-parser was a good match, I poked at it a little and moved on. It started with this[1] commit, then I did some other work locally which was not pushed because google/gumbo-parser is without an owner/maintainer. There are a couple of forks, but no/little adoption it seems.
Scrapy is a whole framework that may be worthwhile, but if I were just starting out for a specific task, I would use:
- requests http://docs.python-requests.org/en/master/
- lxml http://lxml.de/
- cssselect https://cssselect.readthedocs.io/en/latest/
Python 3, AFAIK, doesn't have anything as handy as Ruby/Perl's Mechanize. But using the web developer tools you can usually figure out the requests made by the browser and then use the Session object in the Requests library to deal with stateful requests:
http://docs.python-requests.org/en/master/user/advanced/
I usually just download pages/data/files as raw files and worry about parsing/collating them later. I try to focus on the HTTP mechanics and, if needed, the HTML parsing, before worrying about data extraction.